WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:03.493
[MUSIC PLAYING]

00:00:17.873 --> 00:00:21.040
SPEAKER 1: OK, welcome back, everyone,
to our final topic in an introduction

00:00:21.040 --> 00:00:23.050
to artificial intelligence with Python.

00:00:23.050 --> 00:00:25.390
And today, the topic is language.

00:00:25.390 --> 00:00:27.280
So thus far in the class,
we've seen a number

00:00:27.280 --> 00:00:30.700
of different ways of interacting
with AI, artificial intelligence,

00:00:30.700 --> 00:00:34.690
but it's mostly been happening in
the way of us formulating problems

00:00:34.690 --> 00:00:38.320
in ways that I can understand--
learning to speak the language of AI,

00:00:38.320 --> 00:00:41.800
so to speak, by trying to take a problem
and formulated as a search problem,

00:00:41.800 --> 00:00:45.160
or by trying to take a problem and make
it a constraint satisfaction problem--

00:00:45.160 --> 00:00:47.800
something that our AI
is able to understand.

00:00:47.800 --> 00:00:50.800
Today, we're going to try and come
up with algorithms and ideas that

00:00:50.800 --> 00:00:53.170
allow our AI to meet us
halfway, so to speak--

00:00:53.170 --> 00:00:56.770
to be able to allow AI to be able to
understand, and interpret, and get

00:00:56.770 --> 00:00:58.915
some sort of meaning
out of human language--

00:00:58.915 --> 00:01:00.790
the type of language,
in the spoken language,

00:01:00.790 --> 00:01:03.760
like English, or some other
language that we naturally speak.

00:01:03.760 --> 00:01:06.700
And this turns out to be a
really challenging task for AI.

00:01:06.700 --> 00:01:09.850
And it really encompasses a
number of different types of tasks

00:01:09.850 --> 00:01:13.210
all under the broad heading of
natural language processing,

00:01:13.210 --> 00:01:15.190
the idea of coming up
with algorithms that

00:01:15.190 --> 00:01:19.910
allow our AI to be able to process
and understand natural language.

00:01:19.910 --> 00:01:22.000
So these tasks vary in
terms of the types of tasks

00:01:22.000 --> 00:01:24.490
we might want an AI to perform,
and therefore, the types of

00:01:24.490 --> 00:01:25.698
algorithms that we might use.

00:01:25.698 --> 00:01:28.030
Them but some common
tasks that you might see

00:01:28.030 --> 00:01:30.250
are things like automatic summarization.

00:01:30.250 --> 00:01:33.520
You give an AI a long document,
and you would like for the AI

00:01:33.520 --> 00:01:35.680
to be able to summarize
it, come up with a shorter

00:01:35.680 --> 00:01:39.850
representation of the same idea, but
still in some kind of natural language,

00:01:39.850 --> 00:01:40.780
like English.

00:01:40.780 --> 00:01:44.740
Something like information extraction--
given a whole corpus of information

00:01:44.740 --> 00:01:46.750
in some body of documents
or on the internet,

00:01:46.750 --> 00:01:49.840
for example, we'd like for
our AI to be able to extract

00:01:49.840 --> 00:01:54.070
some sort of meaningful semantic
information out of all of that content

00:01:54.070 --> 00:01:56.470
that it's able to look at and read.

00:01:56.470 --> 00:01:59.020
Language identification--
the task of, given a page,

00:01:59.020 --> 00:02:01.562
can you figure out what language
that document is written in?

00:02:01.562 --> 00:02:04.520
This is the type of thing you might
see if you use a web browser where,

00:02:04.520 --> 00:02:06.280
if you open up a page
in another language,

00:02:06.280 --> 00:02:09.400
that web browser might ask you, oh,
I think it's in this language-- would

00:02:09.400 --> 00:02:12.070
you like me to translate into
English for you, for example?

00:02:12.070 --> 00:02:15.070
And that language
identification process is a task

00:02:15.070 --> 00:02:17.800
that our AI needs to be able to
do, which is then related then

00:02:17.800 --> 00:02:21.550
to machine translation, the process
of taking text in one language

00:02:21.550 --> 00:02:24.190
and translating it into another
language-- which there's

00:02:24.190 --> 00:02:26.710
been a lot of research
and development on really

00:02:26.710 --> 00:02:28.490
over the course of the
last several years.

00:02:28.490 --> 00:02:30.323
And it keeps getting
better, in terms of how

00:02:30.323 --> 00:02:33.010
it is that AI is able to
take text in one language

00:02:33.010 --> 00:02:37.010
and transform that text into
another language as well.

00:02:37.010 --> 00:02:40.330
In addition to that, we have topics
like named entity recognition.

00:02:40.330 --> 00:02:43.840
Given some sequence of text, can you
pick out what the named entities are?

00:02:43.840 --> 00:02:46.300
These are names of companies,
or names of people,

00:02:46.300 --> 00:02:50.050
or names of locations for example, which
are often relevant or important parts

00:02:50.050 --> 00:02:51.580
of a particular document.

00:02:51.580 --> 00:02:55.720
Speech recognition as a related task
not to do with the text that is written,

00:02:55.720 --> 00:02:58.840
but text that is spoken-- being able
to process audio and figure out,

00:02:58.840 --> 00:03:01.070
what are the actual words
that are spoken there?

00:03:01.070 --> 00:03:04.180
And if you think about smart
home devices, like Siri or Alexa,

00:03:04.180 --> 00:03:06.370
for example, these are
all devices that are now

00:03:06.370 --> 00:03:09.460
able to listen to when we
are able to speak, figure out

00:03:09.460 --> 00:03:13.190
what words we are saying, and draw some
sort of meaning out of that as well.

00:03:13.190 --> 00:03:15.398
We've talked about how you
could formulate something,

00:03:15.398 --> 00:03:17.860
for instance, as a hit and
Markov model to be able to draw

00:03:17.860 --> 00:03:19.250
those sorts of conclusions.

00:03:19.250 --> 00:03:22.150
Text classification, more
generally, is a broad category

00:03:22.150 --> 00:03:25.090
of types of ideas, whenever we
want to take some kind of text

00:03:25.090 --> 00:03:27.010
and put it into some sort of category.

00:03:27.010 --> 00:03:29.440
And we've seen these
classification type problems

00:03:29.440 --> 00:03:31.930
and how we can use statistical
machine learning approaches

00:03:31.930 --> 00:03:32.983
to be able to solve them.

00:03:32.983 --> 00:03:35.650
We'll be able to do something
very similar with natural language

00:03:35.650 --> 00:03:38.910
that we may need to make a couple
of adjustments that we'll see soon.

00:03:38.910 --> 00:03:41.500
And then something like
word sense disambiguation,

00:03:41.500 --> 00:03:45.010
the idea that, unlike in
the language of numbers,

00:03:45.010 --> 00:03:48.520
where AI has very precise
representations of everything, words

00:03:48.520 --> 00:03:50.980
and are a little bit fuzzy,
in terms of their meaning,

00:03:50.980 --> 00:03:52.980
and words can have multiple
different meanings--

00:03:52.980 --> 00:03:55.180
and natural language is
inherently ambiguous,

00:03:55.180 --> 00:03:58.360
and we'll take a look at some of
those ambiguities in due time today.

00:03:58.360 --> 00:04:00.760
But one challenging
task, if you want an AI

00:04:00.760 --> 00:04:02.950
to be able to understand
natural language,

00:04:02.950 --> 00:04:05.860
is being able to
disambiguate or differentiate

00:04:05.860 --> 00:04:08.080
between different possible
meanings of words.

00:04:08.080 --> 00:04:12.050
If I say a sentence like, I went to
the bank, you need to figure out,

00:04:12.050 --> 00:04:14.680
do I mean the bank where I
deposit and withdraw money or do

00:04:14.680 --> 00:04:16.240
I mean the bank like the river bank?

00:04:16.240 --> 00:04:18.250
And different words can
have different meanings

00:04:18.250 --> 00:04:19.260
that we might want to figure out.

00:04:19.260 --> 00:04:21.519
And based on the context
in which a word appears--

00:04:21.519 --> 00:04:23.890
the wider sentence,
or paragraph, or paper

00:04:23.890 --> 00:04:25.630
in which a particular word appears--

00:04:25.630 --> 00:04:27.880
that might help to
inform how it is that we

00:04:27.880 --> 00:04:31.390
disambiguate between different
meanings or different senses

00:04:31.390 --> 00:04:32.430
that a word might have.

00:04:32.430 --> 00:04:35.527
And there are many other topics
within natural language processing,

00:04:35.527 --> 00:04:37.360
many other algorithms
that have been devised

00:04:37.360 --> 00:04:40.190
in order to deal with and
address these sorts of problems.

00:04:40.190 --> 00:04:42.607
And today, we're really just
going to scratch the surface,

00:04:42.607 --> 00:04:46.240
looking at some of the fundamental ideas
that are behind many of these ideas

00:04:46.240 --> 00:04:49.750
within natural language processing,
within this idea of trying to come up

00:04:49.750 --> 00:04:53.800
with AI algorithms that are able to do
something meaningful with the languages

00:04:53.800 --> 00:04:55.780
that we speak everyday.

00:04:55.780 --> 00:04:58.480
And so to introduce this idea,
when we think about language,

00:04:58.480 --> 00:05:01.160
we can often think about it in
a couple of different parts.

00:05:01.160 --> 00:05:04.520
The first part refers to
the syntax of language.

00:05:04.520 --> 00:05:07.630
This is more to do with just
the structure of language

00:05:07.630 --> 00:05:09.830
and how it is that that structure works.

00:05:09.830 --> 00:05:13.060
And if you think about natural
language, syntax is one of those things

00:05:13.060 --> 00:05:15.160
that, if you're a native
speaker of a language,

00:05:15.160 --> 00:05:16.570
it comes pretty readily to you.

00:05:16.570 --> 00:05:18.320
You don't have to think
too much about it.

00:05:18.320 --> 00:05:21.600
If I give you a sentence from Sir
Arthur Conan Doyle's Sherlock Holmes,

00:05:21.600 --> 00:05:23.190
for example, a sentence like this--

00:05:23.190 --> 00:05:27.225
"just before 9:00 o'clock, Sherlock
Holmes stepped briskly into the room"--

00:05:27.225 --> 00:05:29.100
I think we could probably
all agree that this

00:05:29.100 --> 00:05:31.830
is a well-formed grammatical sentence.

00:05:31.830 --> 00:05:34.920
Syntactically, it makes
sense, in terms of the way

00:05:34.920 --> 00:05:37.232
that this particular
sentence is structured.

00:05:37.232 --> 00:05:40.440
And syntax applies not just to natural
language, but to programming languages

00:05:40.440 --> 00:05:40.940
as well.

00:05:40.940 --> 00:05:44.430
If you've ever seen a syntax error
in a program that you've written,

00:05:44.430 --> 00:05:47.280
it's likely because you
wrote some sort of program

00:05:47.280 --> 00:05:49.470
that was not syntactically well-formed.

00:05:49.470 --> 00:05:52.080
The structure of it was
not a valid program.

00:05:52.080 --> 00:05:54.780
In the same way, we can look at
English sentences, or sentences

00:05:54.780 --> 00:05:57.600
in any natural language, and
make the same kinds of judgments.

00:05:57.600 --> 00:06:01.290
I can say that this sentence
is syntactically well-formed.

00:06:01.290 --> 00:06:04.260
When all the parts are put together,
all these words are in this order,

00:06:04.260 --> 00:06:08.250
it constructs a grammatical sentence, or
a sentence that most people would agree

00:06:08.250 --> 00:06:09.720
is grammatical.

00:06:09.720 --> 00:06:11.970
But there are also grammatically
ill-formed sentences.

00:06:11.970 --> 00:06:14.370
A sentence like, "just
before Sherlock Holmes

00:06:14.370 --> 00:06:16.518
9 o'clock stepped briskly the room"--

00:06:16.518 --> 00:06:19.560
well, I think we would all agree that
this is not a well-formed sentence.

00:06:19.560 --> 00:06:22.290
Syntactically, it doesn't make sense.

00:06:22.290 --> 00:06:25.290
And this is the type of thing that,
if we want our AI, for example,

00:06:25.290 --> 00:06:27.330
to be able to generate
natural language--

00:06:27.330 --> 00:06:30.250
to be able to speak to us the way
like a chat bot would speak to us,

00:06:30.250 --> 00:06:31.010
for example--

00:06:31.010 --> 00:06:34.260
well then our AI is going to need to be
able to know this distinction somehow,

00:06:34.260 --> 00:06:37.980
is going to be able to know what
kinds of sentences are grammatical,

00:06:37.980 --> 00:06:39.330
what kinds of sentences are not.

00:06:39.330 --> 00:06:42.930
And we might come up with rules or ways
to statistically learn these ideas,

00:06:42.930 --> 00:06:45.840
and we'll talk about some
of those methods as well.

00:06:45.840 --> 00:06:47.910
Syntax can also be ambiguous.

00:06:47.910 --> 00:06:50.970
There are some sentences that are
well-formed and not well-formed,

00:06:50.970 --> 00:06:54.180
but certain way-- there are certain
ways that you could take a sentence

00:06:54.180 --> 00:06:58.260
and potentially construct multiple
different structures for that sentence.

00:06:58.260 --> 00:07:01.830
A sentence like, "I saw the man on
the mountain with a telescope," well,

00:07:01.830 --> 00:07:05.080
this is grammatically well-formed--
syntactically, it makes sense--

00:07:05.080 --> 00:07:07.350
but what is the structure
of the sentence?

00:07:07.350 --> 00:07:10.680
Is it the man on the mountain
who has the telescope, or am

00:07:10.680 --> 00:07:13.860
I seeing the man on the mountain and
I am using the telescope in order

00:07:13.860 --> 00:07:15.270
to see the man on the mountain?

00:07:15.270 --> 00:07:19.050
There's some interesting ambiguity
here, where it could have potentially

00:07:19.050 --> 00:07:21.090
two different types of structures.

00:07:21.090 --> 00:07:23.940
And this is one of the ideas
that will come back to also,

00:07:23.940 --> 00:07:27.690
in terms of how to think about dealing
with AI when natural language is

00:07:27.690 --> 00:07:29.820
inherently ambiguous.

00:07:29.820 --> 00:07:32.070
So that then is syntax,
the structure of language,

00:07:32.070 --> 00:07:34.080
and getting an
understanding for how it is

00:07:34.080 --> 00:07:36.330
that, depending on the order
and placement of words,

00:07:36.330 --> 00:07:38.910
we can come up with different
structures for language.

00:07:38.910 --> 00:07:42.300
But in addition to language having
structure, language also has meaning.

00:07:42.300 --> 00:07:44.700
And now we get into the world
of semantics, the idea of,

00:07:44.700 --> 00:07:47.190
what it is that a word,
or a sequence of words,

00:07:47.190 --> 00:07:51.200
or a sentence, or an entire
essay actually means?

00:07:51.200 --> 00:07:54.300
And so a sentence like, "just
before 9:00, Sherlock Holmes

00:07:54.300 --> 00:07:58.230
stepped briskly into the
room," is a different sentence

00:07:58.230 --> 00:08:01.860
from a sentence like, "Sherlock Holmes
stepped briskly into the room just

00:08:01.860 --> 00:08:03.300
before 9:00."

00:08:03.300 --> 00:08:06.480
And yet they have
effectively the same meaning.

00:08:06.480 --> 00:08:08.430
They're different
sentences, so an AI reading

00:08:08.430 --> 00:08:11.550
them would recognize them as
different, but we as humans

00:08:11.550 --> 00:08:13.650
can look at both the
sentences and say, yeah,

00:08:13.650 --> 00:08:15.295
they mean basically the same thing.

00:08:15.295 --> 00:08:18.420
And maybe, in this case, it was just
because I moved the order of the words

00:08:18.420 --> 00:08:18.920
around.

00:08:18.920 --> 00:08:21.520
Originally, 9 o'clock with near
the beginning of the sentence.

00:08:21.520 --> 00:08:23.700
Now 9 o'clock is near
the end of the sentence.

00:08:23.700 --> 00:08:26.950
But you might imagine that I could come
up with a different sentence entirely,

00:08:26.950 --> 00:08:29.670
a sentence like, "a few minutes
before 9:00, Sherlock Holmes

00:08:29.670 --> 00:08:31.820
walked quickly into the room."

00:08:31.820 --> 00:08:34.650
And OK, that also has
a very similar meaning,

00:08:34.650 --> 00:08:37.799
but I'm using different words
in order to express that idea.

00:08:37.799 --> 00:08:40.230
And ideally, AI would
be able to recognize

00:08:40.230 --> 00:08:43.230
that these two sentences, these
different sets of words that

00:08:43.230 --> 00:08:46.020
are similar to each other,
have similar meanings,

00:08:46.020 --> 00:08:49.090
and to be able to get
at that idea as well.

00:08:49.090 --> 00:08:52.350
Then there are also ways that a
syntactically well-formed sentence

00:08:52.350 --> 00:08:54.150
might not mean anything at all.

00:08:54.150 --> 00:08:57.360
A famous example from linguist Noam
Chomsky is this sentence here--

00:08:57.360 --> 00:09:00.570
"colorless green ideas sleep furiously."

00:09:00.570 --> 00:09:03.660
Syntactically, that
sentence is perfectly fine.

00:09:03.660 --> 00:09:07.080
Colorless and green are adjectives
that modify the noun ideas.

00:09:07.080 --> 00:09:08.010
Sleep is a verb.

00:09:08.010 --> 00:09:09.240
Furiously is an adverb.

00:09:09.240 --> 00:09:12.900
These are correct constructions,
in terms of the order of words,

00:09:12.900 --> 00:09:15.150
but it turns out this
sentence is meaningless.

00:09:15.150 --> 00:09:18.270
If you tried to ascribe meaning to
the sentence, what does it mean?

00:09:18.270 --> 00:09:20.250
And it's not easy to
be able to determine

00:09:20.250 --> 00:09:21.660
what it is that it might mean.

00:09:21.660 --> 00:09:25.355
Semantics itself can also be ambiguous,
given that different structures can

00:09:25.355 --> 00:09:26.730
have different types of meanings.

00:09:26.730 --> 00:09:29.110
Different words can have
different kinds of meanings,

00:09:29.110 --> 00:09:31.290
so the same sentence
with the same structure

00:09:31.290 --> 00:09:33.300
might end up meaning
different types of things.

00:09:33.300 --> 00:09:35.880
So my favorite example
from the LA times is

00:09:35.880 --> 00:09:39.570
a headline that was in the Los
Angeles Times a little while back.

00:09:39.570 --> 00:09:43.410
The headline says, "Big rig carrying
fruit crashes on 210 freeway,

00:09:43.410 --> 00:09:44.633
creates jam."

00:09:44.633 --> 00:09:46.800
So depending on how it is
you look at the sentence--

00:09:46.800 --> 00:09:50.440
how you interpret the sentence-- it
can have multiple different meanings.

00:09:50.440 --> 00:09:53.730
And so here too are challenges in this
world of natural language processing,

00:09:53.730 --> 00:09:56.640
being able to understand
both the syntax of language

00:09:56.640 --> 00:09:58.013
and the semantics of language.

00:09:58.013 --> 00:10:00.180
And today, we'll take a
look at both of those ideas.

00:10:00.180 --> 00:10:02.280
We're going to start
by talking about syntax

00:10:02.280 --> 00:10:05.550
and getting a sense for how it
is that language is structured,

00:10:05.550 --> 00:10:09.150
and how we can start by coming
up with some rules, some ways

00:10:09.150 --> 00:10:12.930
that we can tell our computer,
tell our AI what types of things

00:10:12.930 --> 00:10:16.540
are valid sentences, what types
of things are not valid sentences.

00:10:16.540 --> 00:10:19.070
And ultimately, we'd like
to use that information

00:10:19.070 --> 00:10:21.680
to be able to allow our AI to
draw meaningful conclusions,

00:10:21.680 --> 00:10:23.743
to be able to do
something with language.

00:10:23.743 --> 00:10:25.910
And so to do so, we're going
to start by introducing

00:10:25.910 --> 00:10:27.830
the notion of formal grammar.

00:10:27.830 --> 00:10:30.320
And what formal grammar is
all about its formal grammar

00:10:30.320 --> 00:10:34.400
is a system of rules that
generate sentences in a language.

00:10:34.400 --> 00:10:38.120
I would like to know what are
the valid English sentences--

00:10:38.120 --> 00:10:39.710
not in terms of what they mean--

00:10:39.710 --> 00:10:42.590
just in terms of their structure--
their syntactic structure.

00:10:42.590 --> 00:10:45.740
What structures of English
are valid, correct sentences?

00:10:45.740 --> 00:10:47.780
What structures of
English are not valid?

00:10:47.780 --> 00:10:50.930
And this is going to apply in a very
similar way to other natural languages

00:10:50.930 --> 00:10:54.110
as well, where language follows
certain types of structures.

00:10:54.110 --> 00:10:56.870
And we intuitively know
what these structures mean,

00:10:56.870 --> 00:10:59.840
but it's going to be helpful to
try and really formally define

00:10:59.840 --> 00:11:01.980
what the structures mean as well.

00:11:01.980 --> 00:11:04.520
There are a number of different
types of formal grammar

00:11:04.520 --> 00:11:07.318
all across what's known as the
Chomsky hierarchy of grammars.

00:11:07.318 --> 00:11:09.110
And you may have seen
some of these before.

00:11:09.110 --> 00:11:11.780
If you've ever worked with
regular expressions before,

00:11:11.780 --> 00:11:14.300
those belong to a class
of regular languages.

00:11:14.300 --> 00:11:19.320
They correspond to regular languages,
which is a particular type of language.

00:11:19.320 --> 00:11:21.860
But also on this hierarchy
is a type of grammar

00:11:21.860 --> 00:11:23.193
known as a context-free grammar.

00:11:23.193 --> 00:11:25.235
And this is the one we're
going to spend the most

00:11:25.235 --> 00:11:27.120
time on taking a look at today.

00:11:27.120 --> 00:11:31.640
And what a context-free grammar
is it is a way of taking--

00:11:31.640 --> 00:11:34.760
of generating sentences
in a language or via what

00:11:34.760 --> 00:11:39.020
are known as rewriting rules--
replacing one symbol with other symbols.

00:11:39.020 --> 00:11:42.360
And we'll take a look in a
moment at just what that means.

00:11:42.360 --> 00:11:45.950
So let's imagine, for example,
a simple sentence in English,

00:11:45.950 --> 00:11:48.520
a sentence like, "she saw the city"--

00:11:48.520 --> 00:11:52.190
a valid, syntactically
well-formed English sentence.

00:11:52.190 --> 00:11:55.640
But we'd like for some way for our
AI to be able to look at the sentence

00:11:55.640 --> 00:12:00.200
and figure out, what is the
structure of the sentence?

00:12:00.200 --> 00:12:02.630
If you imagine a guy in
question answering format--

00:12:02.630 --> 00:12:05.812
if you want to ask the AI a
question like, what did she see,

00:12:05.812 --> 00:12:08.270
well, then the AI wants to be
able to look at this sentence

00:12:08.270 --> 00:12:13.530
and recognize that what she saw is the
city-- to be able to figure that out.

00:12:13.530 --> 00:12:15.770
And it requires some
understanding of what

00:12:15.770 --> 00:12:19.760
it is that the structure of
this sentence really looks like.

00:12:19.760 --> 00:12:20.960
So where do we begin?

00:12:20.960 --> 00:12:23.410
Each of these words--
she, saw, the, city--

00:12:23.410 --> 00:12:25.585
we are going to call terminal symbols.

00:12:25.585 --> 00:12:28.460
There are symbols in our language--
where each of these words is just

00:12:28.460 --> 00:12:29.480
a symbol--

00:12:29.480 --> 00:12:32.470
where this is ultimately what
we care about generating.

00:12:32.470 --> 00:12:34.730
We care about generating these words.

00:12:34.730 --> 00:12:37.280
But each of these words
we're also going to associate

00:12:37.280 --> 00:12:40.130
with what we're going to
call a non-terminal symbol.

00:12:40.130 --> 00:12:43.460
And these non-terminal symbols initially
are going to look kind of like parts

00:12:43.460 --> 00:12:46.260
of speech, if you remember
back to like English grammar--

00:12:46.260 --> 00:12:49.880
where she is a noun,
saw is a V for verb,

00:12:49.880 --> 00:12:52.550
the is a D. D stands for determiner.

00:12:52.550 --> 00:12:55.730
These are words like the,
and a, and and, for example.

00:12:55.730 --> 00:12:59.550
And then city-- well, city is
also a noun, so an N goes there.

00:12:59.550 --> 00:13:00.320
So each of these--

00:13:00.320 --> 00:13:01.730
N, V, and D--

00:13:01.730 --> 00:13:04.460
these are what we might
call non-terminal symbols.

00:13:04.460 --> 00:13:07.370
They're not actually
words in the language.

00:13:07.370 --> 00:13:10.010
She saw the city-- those are
the words in the language.

00:13:10.010 --> 00:13:14.210
But we use these non-terminal symbols
to generate the terminal symbols,

00:13:14.210 --> 00:13:16.640
the terminal symbols which
are like, she saw the city--

00:13:16.640 --> 00:13:20.000
the words that are actually
in a language like English.

00:13:20.000 --> 00:13:24.260
And so in order to translate these
non-terminal symbols into terminal

00:13:24.260 --> 00:13:27.422
symbols, we have what are
known as rewriting rules,

00:13:27.422 --> 00:13:29.130
and these rules look
something like this.

00:13:29.130 --> 00:13:32.570
We have N on the left side
of an arrow, and the arrow

00:13:32.570 --> 00:13:35.480
says, if I have an N
non-terminal symbol,

00:13:35.480 --> 00:13:39.410
then I can turn it into any of these
various different possibilities

00:13:39.410 --> 00:13:42.120
that are separated with a vertical line.

00:13:42.120 --> 00:13:45.480
So a noun could translate
into the word she.

00:13:45.480 --> 00:13:49.720
A noun could translate into the
word city, or car, or Harry,

00:13:49.720 --> 00:13:50.970
or any number of other things.

00:13:50.970 --> 00:13:53.810
These are all examples
of nouns, for example.

00:13:53.810 --> 00:13:58.490
Meanwhile, a determiner, D, could
translate into the, or a, or an.

00:13:58.490 --> 00:14:01.310
V for verb could translate
into any of these verbs.

00:14:01.310 --> 00:14:04.430
P for preposition could translate
into any of those prepositions--

00:14:04.430 --> 00:14:06.440
to, on, over, and so forth.

00:14:06.440 --> 00:14:11.420
And then ADJ for adjective can translate
into any of these possible adjectives

00:14:11.420 --> 00:14:12.390
as well.

00:14:12.390 --> 00:14:15.650
So these then are rules in
our context-free grammar.

00:14:15.650 --> 00:14:18.110
When we are defining what
it is that our grammar is,

00:14:18.110 --> 00:14:21.500
what is the structure of the English
language or any other language,

00:14:21.500 --> 00:14:24.710
we give it these types of
rules saying that a noun could

00:14:24.710 --> 00:14:29.360
be any of these possibilities, a verb
could be any of those possibilities.

00:14:29.360 --> 00:14:32.900
But it turns out we can then begin
to construct other rules where

00:14:32.900 --> 00:14:37.392
it's not just one non-terminal
translating into one terminal symbol.

00:14:37.392 --> 00:14:40.100
We're always going to have one
non-terminal on the left-hand side

00:14:40.100 --> 00:14:42.515
of the arrow, but on the
right-hand side of the arrow,

00:14:42.515 --> 00:14:43.640
we could have other things.

00:14:43.640 --> 00:14:46.830
We could even have other
non-terminal symbols.

00:14:46.830 --> 00:14:48.030
So what do I mean by this?

00:14:48.030 --> 00:14:53.070
Well, we have the idea of nouns-- like
she, city, car, Harry, for example--

00:14:53.070 --> 00:14:55.340
but there are also a noun phrases--

00:14:55.340 --> 00:14:57.760
like phrases that work as nouns--

00:14:57.760 --> 00:15:00.900
that are not just a single word,
but there are multiple words.

00:15:00.900 --> 00:15:04.400
Like the city is two words,
that together, operate

00:15:04.400 --> 00:15:06.140
as what we might call a noun phrase.

00:15:06.140 --> 00:15:08.870
It's multiple words, but they're
together operating as a noun.

00:15:08.870 --> 00:15:12.410
Or if you think about a more complex
expression, like the big city--

00:15:12.410 --> 00:15:15.380
three words all operating
as a single noun--

00:15:15.380 --> 00:15:17.200
or the car on the street--

00:15:17.200 --> 00:15:22.390
multiple words now, but that entire set
of words operates kind of like a noun.

00:15:22.390 --> 00:15:25.130
It substitutes as a noun phrase.

00:15:25.130 --> 00:15:27.100
And so to do this, we'll
introduce the notion

00:15:27.100 --> 00:15:32.380
of a new non-terminal symbol called
NP, which will stand for noun phrase.

00:15:32.380 --> 00:15:36.220
And this rewriting rule says that
a noun phrase it could be a noun--

00:15:36.220 --> 00:15:39.250
so something like she is
a noun, and therefore, it

00:15:39.250 --> 00:15:40.810
can also be a noun phrase--

00:15:40.810 --> 00:15:46.360
but a noun phrase could also be a
determiner, D, followed by a noun--

00:15:46.360 --> 00:15:49.315
so two ways we can have a noun
phrase in this very simple grammar.

00:15:49.315 --> 00:15:51.940
Of course, the English language
is more complex than just this,

00:15:51.940 --> 00:15:57.460
but a noun phrase is either a noun or
it is a determiner followed by a noun.

00:15:57.460 --> 00:16:00.130
So for the first example, a
noun phrase that is just a noun,

00:16:00.130 --> 00:16:04.150
that would allow us to
generate noun phrases like she,

00:16:04.150 --> 00:16:07.960
because a noun phrase is
just a noun, and a noun

00:16:07.960 --> 00:16:10.833
could be the word she, for example.

00:16:10.833 --> 00:16:13.750
Meanwhile, if we wanted to look at
one of the examples of these, where

00:16:13.750 --> 00:16:16.750
a noun phrase becomes a
determiner and a noun,

00:16:16.750 --> 00:16:18.460
then we get a structure like this.

00:16:18.460 --> 00:16:21.250
And now we're starting to
see the structure of language

00:16:21.250 --> 00:16:24.970
emerge from these rules in a
syntax tree, as we'll call it,

00:16:24.970 --> 00:16:29.260
this tree-like structure that represents
the syntax of our natural language.

00:16:29.260 --> 00:16:31.960
Here, we have a noun
phrase, and this noun phrase

00:16:31.960 --> 00:16:36.460
is composed of a determiner and a noun,
where the determiner is the word the,

00:16:36.460 --> 00:16:40.310
according to that rule,
and noun is the word city.

00:16:40.310 --> 00:16:43.930
So here then is a noun phrase that
consists of multiple words inside

00:16:43.930 --> 00:16:45.130
of the structure.

00:16:45.130 --> 00:16:50.140
And using this idea of taking one symbol
and rewriting it using other symbols--

00:16:50.140 --> 00:16:52.900
that might be terminal
symbols, like the and city,

00:16:52.900 --> 00:16:57.670
but might also be non-terminal symbols,
like D for determiner or N for noun--

00:16:57.670 --> 00:17:01.090
then we can begin to construct
more and more complex structures.

00:17:01.090 --> 00:17:04.420
In addition to noun phrases, we
can also think about verb phrases.

00:17:04.420 --> 00:17:06.740
So what might a verb phrase look like?

00:17:06.740 --> 00:17:09.670
Well, a verb phrase might
just be a single verb.

00:17:09.670 --> 00:17:13.660
In a sentence like "I
walked," walked is a verb,

00:17:13.660 --> 00:17:17.329
and that is acting as the
verb phrase in that sentence.

00:17:17.329 --> 00:17:21.493
But there are also more complex verb
phrases that aren't just a single word,

00:17:21.493 --> 00:17:22.660
but that are multiple words.

00:17:22.660 --> 00:17:25.970
If you think of the sentence like
"she saw the city," for example,

00:17:25.970 --> 00:17:29.260
saw the city is really
that entire verb phrase.

00:17:29.260 --> 00:17:33.245
It's taking up like what it is
that she is doing, for example.

00:17:33.245 --> 00:17:35.370
And so our verb phrase
might have a rule like this.

00:17:35.370 --> 00:17:38.830
A verb phrase is either
just a plain verb

00:17:38.830 --> 00:17:43.090
or it is a verb followed
by a noun phrase.

00:17:43.090 --> 00:17:45.940
And we saw before that a
noun phrase is either a noun

00:17:45.940 --> 00:17:48.580
or it is a determiner
followed by a noun.

00:17:48.580 --> 00:17:50.710
And so a verb phrase
might be something simple,

00:17:50.710 --> 00:17:52.960
like verb phrase it is just a verb.

00:17:52.960 --> 00:17:55.587
And that verb could be the
word walked for example.

00:17:55.587 --> 00:17:57.670
But it could also be
something more sophisticated,

00:17:57.670 --> 00:18:01.780
something like this noun, where we
begin to see a larger syntax tree,

00:18:01.780 --> 00:18:04.450
where the way to read the
syntax tree is that a verb

00:18:04.450 --> 00:18:07.690
phrase is a verb and
a noun phrase, where

00:18:07.690 --> 00:18:09.380
that verb could be something like saw.

00:18:09.380 --> 00:18:12.130
And this is a noun phrase we've
seen before, this noun phrase that

00:18:12.130 --> 00:18:17.050
is the city-- a noun phrase composed
of the determiner the and the noun

00:18:17.050 --> 00:18:21.068
city all put together to
construct this larger verb phrase.

00:18:21.068 --> 00:18:23.110
And then just to give one
more example of a rule,

00:18:23.110 --> 00:18:24.652
we could also have a rule like this--

00:18:24.652 --> 00:18:28.180
sentence S goes to noun
phrase and a verb phrase.

00:18:28.180 --> 00:18:30.580
The basic structure of
a sentence is that it is

00:18:30.580 --> 00:18:32.680
a noun phrase followed by verb phrase.

00:18:32.680 --> 00:18:35.320
And this is a formal grammar
way of expressing the idea

00:18:35.320 --> 00:18:38.445
that you might have learned when you
learned English grammar, when you read

00:18:38.445 --> 00:18:42.190
that a sentence is like a subject
and a verb, subject and action--

00:18:42.190 --> 00:18:45.330
something that's happening
to a particular noun phrase.

00:18:45.330 --> 00:18:47.650
And so using this structure,
we could construct

00:18:47.650 --> 00:18:49.740
a sentence that looks like this.

00:18:49.740 --> 00:18:53.140
A sentence consists of a noun
phrase and a verb phrase.

00:18:53.140 --> 00:18:56.080
A noun phrase could just be
a noun, like the word she.

00:18:56.080 --> 00:18:58.180
The verb phrase could be
a verb and a noun phrase,

00:18:58.180 --> 00:19:00.940
where-- this is something we've
seen before-- the verb is saw

00:19:00.940 --> 00:19:03.838
and the noun phrase is the city.

00:19:03.838 --> 00:19:05.380
And so now look what we've done here.

00:19:05.380 --> 00:19:08.160
What we've done is, by
defining a set of rules,

00:19:08.160 --> 00:19:11.940
there are algorithms that we
can run that take these words--

00:19:11.940 --> 00:19:15.190
and the CYK algorithm, for example, is
one example of this if you want to look

00:19:15.190 --> 00:19:15.880
into that--

00:19:15.880 --> 00:19:20.200
where you start with a set of terminal
symbols, like she saw the city,

00:19:20.200 --> 00:19:22.630
and then using these rules,
you're able to figure out,

00:19:22.630 --> 00:19:26.958
how is it that you go from a
sentence to she saw the city?

00:19:26.958 --> 00:19:28.750
And it's all through
these rewriting rules.

00:19:28.750 --> 00:19:31.310
So the sentence is a noun
phrase and a verb phrase.

00:19:31.310 --> 00:19:34.600
A verb phrase could be a verb and
a noun phrase, so on and so forth,

00:19:34.600 --> 00:19:37.000
where you can imagine
taking this structure

00:19:37.000 --> 00:19:41.510
and figuring out how it is that
you could generate a parse tree--

00:19:41.510 --> 00:19:46.290
a syntax tree-- for that set of
terminal symbols, that set of words.

00:19:46.290 --> 00:19:49.990
And if you tried to do this for a
sentence that was not grammatical,

00:19:49.990 --> 00:19:53.830
something like "saw the city
she," well, that wouldn't work.

00:19:53.830 --> 00:19:56.320
There'd be no way to
take a sentence and use

00:19:56.320 --> 00:19:58.720
these rules to be able to
generate that sentence that

00:19:58.720 --> 00:20:01.220
is not inside of that language.

00:20:01.220 --> 00:20:03.490
So this sort of model
can be very helpful

00:20:03.490 --> 00:20:06.040
if the rules are expressive
enough to express

00:20:06.040 --> 00:20:09.400
all the ideas that you might want to
express inside of natural language.

00:20:09.400 --> 00:20:12.003
Of course, using just the
simple rules we have here,

00:20:12.003 --> 00:20:14.920
there are many sentences that we
won't be able to generate-- sentences

00:20:14.920 --> 00:20:18.280
that we might agree are grim
and syntactically well-formed,

00:20:18.280 --> 00:20:21.450
but that we're not going to be able
to construct using these rules.

00:20:21.450 --> 00:20:23.200
And then, in that case,
we might just need

00:20:23.200 --> 00:20:28.300
to have some more complex rules in
order to deal with those sorts of cases.

00:20:28.300 --> 00:20:30.370
And so this type of
approach can be powerful

00:20:30.370 --> 00:20:33.430
if you're dealing with a
limited set of rules and words

00:20:33.430 --> 00:20:35.230
that you really care about dealing with.

00:20:35.230 --> 00:20:37.690
And one way we can actually
interact with this in Python

00:20:37.690 --> 00:20:42.100
is by using a Python library called
NLTK, short for natural language

00:20:42.100 --> 00:20:44.410
toolkit, which we'll see
a couple of times today,

00:20:44.410 --> 00:20:47.410
which has a wide variety of
different functions and classes

00:20:47.410 --> 00:20:49.300
that we can take
advantage of that are all

00:20:49.300 --> 00:20:51.100
meant to deal with natural language.

00:20:51.100 --> 00:20:54.700
And one such algorithm that
it has is the ability to parse

00:20:54.700 --> 00:20:57.670
a context-free grammar, to
be able to take some words

00:20:57.670 --> 00:20:59.920
and figure out according to
some context-free grammar,

00:20:59.920 --> 00:21:02.892
how would you construct
the syntax tree for it?

00:21:02.892 --> 00:21:04.600
So let's go ahead and
take a look at NLTK

00:21:04.600 --> 00:21:09.950
now by examining how we might construct
some context-free grammars with it.

00:21:09.950 --> 00:21:12.110
So here inside of cfg0--

00:21:12.110 --> 00:21:14.410
cfg's short for context-free grammar--

00:21:14.410 --> 00:21:19.230
I have a sample context-free grammar
which has rules that we've seen before.

00:21:19.230 --> 00:21:22.330
So sentence goes to noun phrase
followed by a verb phrase.

00:21:22.330 --> 00:21:25.900
Noun phrase is either a
determiner and a noun or a noun.

00:21:25.900 --> 00:21:29.080
Verb phrase is either a verb
or a verb and a noun phrase.

00:21:29.080 --> 00:21:32.020
The order of these things
doesn't really matter.

00:21:32.020 --> 00:21:34.480
Determiners could be the
word the or the word a.

00:21:34.480 --> 00:21:37.630
A noun could be the
word she, city, or car.

00:21:37.630 --> 00:21:42.040
And a verb could be the word saw
or it could be the word walked.

00:21:42.040 --> 00:21:45.100
Now, using NLTK, which I've
imported here at the top,

00:21:45.100 --> 00:21:47.800
I'm going to go ahead
and parse this grammar

00:21:47.800 --> 00:21:50.823
and save it inside of this
variable called parser.

00:21:50.823 --> 00:21:52.990
Next, my program is going
to ask the user for input.

00:21:52.990 --> 00:21:55.630
Just type in a sentence,
and dot split will just

00:21:55.630 --> 00:21:57.790
split it on all of the
spaces, so I end up

00:21:57.790 --> 00:22:00.360
getting each of the individual words.

00:22:00.360 --> 00:22:03.400
We're going to save that inside
of this list called sentence.

00:22:03.400 --> 00:22:08.350
And then we'll go ahead and try to parse
the sentence, and for each sentence

00:22:08.350 --> 00:22:10.840
we parse, we're going to
pretty print it to the screen,

00:22:10.840 --> 00:22:12.327
just so it displays in my terminal.

00:22:12.327 --> 00:22:13.660
And we're also going to draw it.

00:22:13.660 --> 00:22:16.210
It turns out that NLTK has
some graphics capacity,

00:22:16.210 --> 00:22:19.632
so we can really visually see
what that tree looks like as well.

00:22:19.632 --> 00:22:22.340
And there are multiple different
ways a sentence might be parsed,

00:22:22.340 --> 00:22:24.700
which is why we're putting
it inside of this for loop.

00:22:24.700 --> 00:22:27.762
And we'll see why that can
be helpful in a moment too.

00:22:27.762 --> 00:22:30.220
All right, now that I have
that, let's go ahead and try it.

00:22:30.220 --> 00:22:34.840
I'll cd into cfg, and we'll
go ahead and run cfg0.

00:22:34.840 --> 00:22:37.450
So it then is going to prompt
me to type in a sentence.

00:22:37.450 --> 00:22:39.658
And let me type in a very
simple sentence-- something

00:22:39.658 --> 00:22:42.070
like she walked, for example.

00:22:42.070 --> 00:22:43.240
Press Return.

00:22:43.240 --> 00:22:45.510
So what I get is, on
the left-hand side, you

00:22:45.510 --> 00:22:48.902
can see a text-based
representation of the syntax tree.

00:22:48.902 --> 00:22:51.610
And on the right side here-- let
me go ahead and make it bigger--

00:22:51.610 --> 00:22:55.240
we see a visual representation
of that same syntax tree.

00:22:55.240 --> 00:22:59.960
This is how it is that my computer has
now parsed the sentence she walked.

00:22:59.960 --> 00:23:02.980
It's a sentence that consists of
a noun phrase and a verb phrase,

00:23:02.980 --> 00:23:06.790
where each phrase is just a single
noun or verb, she and then walked--

00:23:06.790 --> 00:23:09.100
same type of structure
we've seen before,

00:23:09.100 --> 00:23:11.410
but this now is our
computer able to understand

00:23:11.410 --> 00:23:13.990
the structure of the
sentence, to be able to get

00:23:13.990 --> 00:23:17.920
some sort of structural understanding
of how it is that parts of the sentence

00:23:17.920 --> 00:23:19.660
relate to each other.

00:23:19.660 --> 00:23:21.460
Let me now give it another sentence.

00:23:21.460 --> 00:23:25.180
I could try something like she
saw the city, for example--

00:23:25.180 --> 00:23:27.350
the words we were dealing
with a moment ago.

00:23:27.350 --> 00:23:31.050
And then we end up getting
this syntax tree out of it--

00:23:31.050 --> 00:23:34.170
again, a sentence that has a
noun phrase and a verb phrase.

00:23:34.170 --> 00:23:35.800
The noun phrase is fairly simple.

00:23:35.800 --> 00:23:36.960
It's just she.

00:23:36.960 --> 00:23:38.460
But the verb phrase is more complex.

00:23:38.460 --> 00:23:42.390
It is now saw the city, for example.

00:23:42.390 --> 00:23:44.790
Let's do one more with this grammar.

00:23:44.790 --> 00:23:47.343
Let's do something like she saw a car.

00:23:47.343 --> 00:23:49.010
And that is going to look very similar--

00:23:49.010 --> 00:23:50.328
that we also get she.

00:23:50.328 --> 00:23:51.870
But our verb phrase is now different.

00:23:51.870 --> 00:23:55.220
It's saw a car, because there
are multiple possible determiners

00:23:55.220 --> 00:23:57.307
in our language and
multiple possible nouns.

00:23:57.307 --> 00:23:59.390
I haven't given this grammar
rule that many words,

00:23:59.390 --> 00:24:01.790
but if I gave it a larger
vocabulary, it would then

00:24:01.790 --> 00:24:06.360
be able to understand more and
more different types of sentences.

00:24:06.360 --> 00:24:09.590
And just to give you a sense of some
added complexity we could add here,

00:24:09.590 --> 00:24:12.568
the more complex our grammar,
the more rules we add,

00:24:12.568 --> 00:24:14.360
the more different
types of sentences we'll

00:24:14.360 --> 00:24:15.860
then have the ability to generate.

00:24:15.860 --> 00:24:18.410
So let's take a look
at cfg1, for example,

00:24:18.410 --> 00:24:21.590
where I've added a whole number
of other different types of rules.

00:24:21.590 --> 00:24:25.970
I've added the adjective phrases, where
we can have multiple adjectives inside

00:24:25.970 --> 00:24:27.590
of a noun phrase as well.

00:24:27.590 --> 00:24:31.310
So a noun phrase could be an adjective
phrase followed by a noun phrase.

00:24:31.310 --> 00:24:33.650
If I wanted to say
something like the big city,

00:24:33.650 --> 00:24:37.250
that's an adjective phrase
followed by a noun phrase.

00:24:37.250 --> 00:24:40.740
Or we could also have a noun
and a prepositional phrase--

00:24:40.740 --> 00:24:43.250
so the car on the street, for example.

00:24:43.250 --> 00:24:46.100
On the street is a
prepositional phrase, and we

00:24:46.100 --> 00:24:50.060
might want to combine those two ideas
together, because the car on the street

00:24:50.060 --> 00:24:53.333
can still operate as something
kind of like a noun phrase as well.

00:24:53.333 --> 00:24:56.000
So no need to understand all of
these rules in too much detail--

00:24:56.000 --> 00:24:59.240
it starts to get into the
nature of English grammar--

00:24:59.240 --> 00:25:04.980
but now we have a more complex way of
understanding these types of sentences.

00:25:04.980 --> 00:25:07.190
So if I run Python cfg1--

00:25:07.190 --> 00:25:13.130
and I can try typing something like
she saw the wide street, for example--

00:25:13.130 --> 00:25:14.840
a more complex sentence.

00:25:14.840 --> 00:25:18.990
And if we make that larger, you can
see what this sentence looks like.

00:25:18.990 --> 00:25:21.700
I'll go ahead and
shrink it a little bit.

00:25:21.700 --> 00:25:26.100
So now we have a sentence like
this-- she saw the wide street.

00:25:26.100 --> 00:25:28.830
The wide street is one
entire noun phrase,

00:25:28.830 --> 00:25:31.470
saw the wide street is
an entire verb phrase,

00:25:31.470 --> 00:25:35.830
and she saw the wide street ends
up forming that entire sentence.

00:25:35.830 --> 00:25:40.150
So let's take a look at one more example
to introduce this notion of ambiguity.

00:25:40.150 --> 00:25:42.060
So I can run Python cfg1.

00:25:42.060 --> 00:25:48.540
Let me type a sentence like
she saw a dog with binoculars.

00:25:48.540 --> 00:25:52.860
So there's a sentence, and here
now is one possible syntax tree

00:25:52.860 --> 00:25:54.510
to represent this idea--

00:25:54.510 --> 00:25:59.190
she saw, the noun phrase a dog,
and then the prepositional phrase

00:25:59.190 --> 00:26:00.390
with binoculars.

00:26:00.390 --> 00:26:06.000
And the way to interpret the sentence is
that what it is that she saw was a dog.

00:26:06.000 --> 00:26:07.980
And how did she do the seeing?

00:26:07.980 --> 00:26:10.680
She did the seeing with binoculars.

00:26:10.680 --> 00:26:13.080
And so this is one possible
way to interpret this.

00:26:13.080 --> 00:26:14.730
She was using binoculars.

00:26:14.730 --> 00:26:18.170
Using those binoculars, she saw a dog.

00:26:18.170 --> 00:26:21.000
But another possible way
to pass that sentence

00:26:21.000 --> 00:26:25.020
would be with this tree over
here, where you have something

00:26:25.020 --> 00:26:31.000
like she saw a dog with binoculars,
where a dog with binoculars

00:26:31.000 --> 00:26:33.340
forms an entire noun phrase of its own--

00:26:33.340 --> 00:26:37.000
same words in the same order, but
a different grammatical structure,

00:26:37.000 --> 00:26:41.350
where now we have a dog with binoculars
all inside of this noun phrase,

00:26:41.350 --> 00:26:42.700
meaning what did she see?

00:26:42.700 --> 00:26:44.920
What she saw was a dog,
and that dog happened

00:26:44.920 --> 00:26:49.210
to have binoculars with the dog-- so
different ways to parse the sentence--

00:26:49.210 --> 00:26:53.700
structures for the sentence-- even given
the same possible sequence of words.

00:26:53.700 --> 00:26:56.320
And NLTK's algorithm and
this particular algorithm

00:26:56.320 --> 00:26:58.150
has the ability to find
all of these, to be

00:26:58.150 --> 00:27:00.610
able to understand the
different ways that you might

00:27:00.610 --> 00:27:05.080
be able to parse a sentence and be able
to extract some sort of useful meaning

00:27:05.080 --> 00:27:07.900
out of that sentence as well.

00:27:07.900 --> 00:27:11.650
So that then is a brief
look at what we can do--

00:27:11.650 --> 00:27:16.300
using getting the structure of language,
of using these context-free grammar

00:27:16.300 --> 00:27:19.270
rules to be able to describe
the structure of language.

00:27:19.270 --> 00:27:22.150
But what we might also
care about is understanding

00:27:22.150 --> 00:27:24.700
how it is that these
sequences of words are

00:27:24.700 --> 00:27:29.080
likely to relate to each other in
terms of the actual words themselves.

00:27:29.080 --> 00:27:33.100
The grammar that we saw before could
allow us to generate a sentence like,

00:27:33.100 --> 00:27:37.930
I eat a banana, for example, where I
is the noun phrase and ate a banana

00:27:37.930 --> 00:27:39.190
is a verb phrase.

00:27:39.190 --> 00:27:41.800
But it would also allow
for sentences like, I

00:27:41.800 --> 00:27:46.180
eat a blue car, for example, which
is also syntactically well-formed

00:27:46.180 --> 00:27:50.830
according to the rules, but is probably
a less likely sentence that a person is

00:27:50.830 --> 00:27:51.640
likely to speak.

00:27:51.640 --> 00:27:54.550
And we might want for our
AI to be able to encapsulate

00:27:54.550 --> 00:28:00.140
the idea that certain sequences of words
are more or less likely than others.

00:28:00.140 --> 00:28:03.880
So to deal with that, we'll
introduce the notion of an n-gram,

00:28:03.880 --> 00:28:06.910
and an n-gram, more generally,
just refers to some sequence

00:28:06.910 --> 00:28:09.880
of n items inside of our text.

00:28:09.880 --> 00:28:12.350
And those items might take
various different forms.

00:28:12.350 --> 00:28:15.220
We can have character n-grams,
which are just a contiguous

00:28:15.220 --> 00:28:18.520
sequence of n characters--
so three characters in a row,

00:28:18.520 --> 00:28:20.770
for example, or four
characters in a row.

00:28:20.770 --> 00:28:23.500
We can also have word n-grams,
which are a contiguous

00:28:23.500 --> 00:28:28.840
sequence of n words in a row
from a particular sample of text.

00:28:28.840 --> 00:28:30.760
And these end up proving
quite useful, and you

00:28:30.760 --> 00:28:34.700
can choose our n to decide how many
how long is our sequence going to be.

00:28:34.700 --> 00:28:39.170
So when n is 1, we're just looking at
a single word or a single character.

00:28:39.170 --> 00:28:42.760
And that is what we might
call a unigram, just one item.

00:28:42.760 --> 00:28:45.160
If we're looking at two
characters or two words,

00:28:45.160 --> 00:28:47.590
that's generally called
a bigram-- so an n-gram

00:28:47.590 --> 00:28:51.205
where n is equal to 2, looking at
two words that are consecutive.

00:28:51.205 --> 00:28:53.080
And then, if there are
three items, you might

00:28:53.080 --> 00:28:56.200
imagine we'll often call those
trigrams-- so three characters

00:28:56.200 --> 00:29:00.770
in a row or three words that happen
to be in a contiguous sequence.

00:29:00.770 --> 00:29:04.000
And so if we took a
sentence, for example--

00:29:04.000 --> 00:29:06.367
here's a sentence from,
again, Sherlock Holmes--

00:29:06.367 --> 00:29:08.200
"how often have I said
to you that, when you

00:29:08.200 --> 00:29:10.540
have eliminated the
impossible, whatever remains,

00:29:10.540 --> 00:29:13.300
however improbable, must be the truth."

00:29:13.300 --> 00:29:16.090
What are the trigrams that we
can extract from the sentence?

00:29:16.090 --> 00:29:18.830
If we're looking at
sequences of three words,

00:29:18.830 --> 00:29:21.280
well, the first trigram
would be how often

00:29:21.280 --> 00:29:23.890
have-- just a sequence of three words.

00:29:23.890 --> 00:29:25.960
And then we can look
at the next trigram,

00:29:25.960 --> 00:29:29.200
often have I. The next
trigram is have I said.

00:29:29.200 --> 00:29:32.320
Then I said to, said to you,
to you that, for example--

00:29:32.320 --> 00:29:36.700
those are all trigrams of words,
sequences of three contiguous words

00:29:36.700 --> 00:29:38.410
that show up in the text.

00:29:38.410 --> 00:29:43.120
And extracting those bigrams and
trigrams, or n-grams more generally,

00:29:43.120 --> 00:29:45.820
turns out to be quite
helpful, because often,

00:29:45.820 --> 00:29:48.113
when we're dealing with
analyzing a lot of text,

00:29:48.113 --> 00:29:50.530
it's not going to be particularly
meaningful for us to try

00:29:50.530 --> 00:29:53.990
and analyze the entire text at one time.

00:29:53.990 --> 00:29:57.670
But instead, we want to segment
that text into pieces that we

00:29:57.670 --> 00:29:59.650
can begin to do some analysis of--

00:29:59.650 --> 00:30:03.070
that our AI might never have
seen this entire sentence before,

00:30:03.070 --> 00:30:07.810
but it's probably seen the
trigram to you that before,

00:30:07.810 --> 00:30:11.710
because to you that is something that
might have come up in other documents

00:30:11.710 --> 00:30:13.240
that our AI has seen before.

00:30:13.240 --> 00:30:16.900
And therefore, it knows a little
bit about that particular sequence

00:30:16.900 --> 00:30:20.890
of three words in a row-- or
something like have I said,

00:30:20.890 --> 00:30:24.820
another example of another sequence
of three words that's probably

00:30:24.820 --> 00:30:28.880
quite popular, in terms of where you
see it inside the English language.

00:30:28.880 --> 00:30:32.433
So we'd like some way to be able
to extract these sorts of n-grams.

00:30:32.433 --> 00:30:33.350
And how do we do that?

00:30:33.350 --> 00:30:35.770
How do we extract
sequences of three words?

00:30:35.770 --> 00:30:39.490
Well, we need to take our
input and somehow separate it

00:30:39.490 --> 00:30:41.810
into all of the individual words.

00:30:41.810 --> 00:30:45.010
And this is a process generally
known as tokenization,

00:30:45.010 --> 00:30:48.250
the task of splitting up some
sequence into distinct pieces,

00:30:48.250 --> 00:30:50.440
where we call those pieces tokens.

00:30:50.440 --> 00:30:53.480
Most commonly, this refers to
something like word tokenization.

00:30:53.480 --> 00:30:55.810
I have some sequence of text
and I want to split it up

00:30:55.810 --> 00:30:58.810
into all of the words
that show up in that text.

00:30:58.810 --> 00:31:01.240
But it might also come up
in the context of something

00:31:01.240 --> 00:31:02.680
like sentence tokenization.

00:31:02.680 --> 00:31:05.950
I have a long sequence of text
and I'd like to split it up

00:31:05.950 --> 00:31:08.050
into sentences, for example.

00:31:08.050 --> 00:31:11.260
And so how might word tokenization
work, the task of splitting up

00:31:11.260 --> 00:31:13.660
our sequence of characters into words?

00:31:13.660 --> 00:31:15.640
Well, we've also already seen this idea.

00:31:15.640 --> 00:31:18.610
We've seen that, in word
tokenization just a moment ago, I

00:31:18.610 --> 00:31:22.660
took an input sequence and I just called
Python's split method on it, where

00:31:22.660 --> 00:31:25.360
the split method took
that sequence of words

00:31:25.360 --> 00:31:29.880
and just separated it based on where
the spaces showed up in that word.

00:31:29.880 --> 00:31:33.640
And so if I had a sentence like,
whatever remains, however improbable,

00:31:33.640 --> 00:31:37.620
must be the truth, how
would I tokenize this?

00:31:37.620 --> 00:31:41.460
Well, the naive approach is just
to say, anytime you see a space,

00:31:41.460 --> 00:31:42.600
go ahead and split it up.

00:31:42.600 --> 00:31:46.800
We're going to split up this particular
string just by looking for spaces.

00:31:46.800 --> 00:31:49.830
And what we get when we do
that is a sentence like this--

00:31:49.830 --> 00:31:53.660
whatever remains, however
improbable, must be the truth.

00:31:53.660 --> 00:31:56.160
But what you'll notice here is
that, if we just split things

00:31:56.160 --> 00:32:00.930
up in terms of where the spaces are, we
end up keeping the punctuation around.

00:32:00.930 --> 00:32:02.960
There's a comma after the word remains.

00:32:02.960 --> 00:32:06.030
There's a comma after
improbable, a period after truth.

00:32:06.030 --> 00:32:08.160
And this poses a little
bit of a challenge, when

00:32:08.160 --> 00:32:11.820
we think about trying to tokenize
things into individual words,

00:32:11.820 --> 00:32:15.150
because if you're comparing
words to each other, this word

00:32:15.150 --> 00:32:16.712
truth with a period after it--

00:32:16.712 --> 00:32:18.420
if you just string
compare it, it's going

00:32:18.420 --> 00:32:21.270
to be different from the word
truth without a period after it.

00:32:21.270 --> 00:32:23.810
And so this punctuation can
sometimes pose a problem for us,

00:32:23.810 --> 00:32:27.060
and so we might want some way of dealing
with it-- either treating punctuation

00:32:27.060 --> 00:32:30.990
as a separate token altogether or maybe
removing that punctuation entirely

00:32:30.990 --> 00:32:32.920
from our sequence as well.

00:32:32.920 --> 00:32:35.020
So that might be
something we want to do.

00:32:35.020 --> 00:32:38.010
But there are other cases where it
becomes a little bit less clear.

00:32:38.010 --> 00:32:40.680
If I said something like,
just before 9:00 o'clock,

00:32:40.680 --> 00:32:43.110
Sherlock Holmes stepped
briskly into the room,

00:32:43.110 --> 00:32:46.167
well, this apostrophe after 9 o'clock--

00:32:46.167 --> 00:32:48.750
after the O in 9 o'clock-- is
that something we should remove?

00:32:48.750 --> 00:32:52.080
Should be split based on that
as well, and do O and clock?

00:32:52.080 --> 00:32:54.090
There's some interesting
questions there too.

00:32:54.090 --> 00:32:57.360
And it gets even trickier if you begin
to think about hyphenated words--

00:32:57.360 --> 00:33:00.650
something like this, where we
have a whole bunch of words

00:33:00.650 --> 00:33:03.840
that are hyphenated and then you
need to make a judgment call.

00:33:03.840 --> 00:33:06.180
Is that a place where you're
going to split things apart

00:33:06.180 --> 00:33:09.840
into individual words, or are you going
to consider frock-coat, and well-cut,

00:33:09.840 --> 00:33:13.300
and pearl-grey to be
individual words of their own?

00:33:13.300 --> 00:33:16.530
And so those tend to pose challenges
that we need to somehow deal with

00:33:16.530 --> 00:33:19.890
and something we need to
decide as we go about trying

00:33:19.890 --> 00:33:21.790
to perform this kind of analysis.

00:33:21.790 --> 00:33:25.950
Similar challenges arise when it comes
to the world of sentence tokenization.

00:33:25.950 --> 00:33:29.410
Imagine this sequence of
sentences, for example.

00:33:29.410 --> 00:33:31.927
If you take a look at this
particular sequence of sentences,

00:33:31.927 --> 00:33:35.010
you could probably imagine you could
extract the sentences pretty readily.

00:33:35.010 --> 00:33:38.060
Here is one sentence and
here is a second sentence,

00:33:38.060 --> 00:33:43.060
so we have two different sentences
inside of this particular passage.

00:33:43.060 --> 00:33:46.260
And the distinguishing feature
seems to be the period--

00:33:46.260 --> 00:33:48.963
that a period separates
one sentence from another.

00:33:48.963 --> 00:33:50.880
And maybe there are other
types of punctuation

00:33:50.880 --> 00:33:52.830
you might include here as well--

00:33:52.830 --> 00:33:55.740
an exclamation point, for
example, or a question mark.

00:33:55.740 --> 00:33:58.080
But those are the types of
punctuation that we know

00:33:58.080 --> 00:34:00.750
tend to come at the end of sentences.

00:34:00.750 --> 00:34:04.410
But it gets trickier again if you look
at a sentence like this-- not just

00:34:04.410 --> 00:34:07.140
sure talking to Sherlock, but
instead of talking to Sherlock,

00:34:07.140 --> 00:34:09.449
talking to Mr. Holmes.

00:34:09.449 --> 00:34:11.313
Well now, we have a
period at the end of Mr.

00:34:11.313 --> 00:34:13.230
And so if you were just
separating on periods,

00:34:13.230 --> 00:34:15.570
you might imagine this
would be a sentence,

00:34:15.570 --> 00:34:17.760
and then just Holmes
would be a sentence,

00:34:17.760 --> 00:34:19.800
and then we'd have a
third sentence down below.

00:34:19.800 --> 00:34:23.159
Things do get a little
bit trickier as you start

00:34:23.159 --> 00:34:25.050
to imagine these sorts of situations.

00:34:25.050 --> 00:34:27.690
And dialogue too starts to
make this trickier as well--

00:34:27.690 --> 00:34:31.860
that if you have these sorts of lines
that are inside of something that--

00:34:31.860 --> 00:34:33.150
he said, for example--

00:34:33.150 --> 00:34:35.639
that he said this
particular sequence of words

00:34:35.639 --> 00:34:37.469
and then this particular
sequence of words.

00:34:37.469 --> 00:34:40.170
There are interesting
challenges that arise there too,

00:34:40.170 --> 00:34:42.389
in terms of how it is
that we take the sentence

00:34:42.389 --> 00:34:46.268
and split it up into
individual sentences as well.

00:34:46.268 --> 00:34:48.810
And these are just things that
our algorithm needs to decide.

00:34:48.810 --> 00:34:51.370
In practice, there usually some
heuristics that we can use.

00:34:51.370 --> 00:34:53.610
We know there are certain
occurrences of periods,

00:34:53.610 --> 00:34:56.580
like the period after Mr.,
or in other examples where

00:34:56.580 --> 00:34:59.010
we know that is not the
beginning of a new sentence,

00:34:59.010 --> 00:35:01.770
and so we can encode
those rules into our AI

00:35:01.770 --> 00:35:04.680
to allow it to be able to
do this tokenization the way

00:35:04.680 --> 00:35:06.060
that we want it to.

00:35:06.060 --> 00:35:09.960
So once we have these ability to
tokenize a particular passage--

00:35:09.960 --> 00:35:12.930
take the passage, split it
up into individual words--

00:35:12.930 --> 00:35:17.110
from there, we can begin to extract
what the n-grams actually are.

00:35:17.110 --> 00:35:20.190
So we can actually take
a look at this by going

00:35:20.190 --> 00:35:23.250
into a Python program that will
serve the purpose of extracting

00:35:23.250 --> 00:35:24.630
these n-grams.

00:35:24.630 --> 00:35:27.510
And again, we can use NLTK, the
Natural Language Toolkit, in order

00:35:27.510 --> 00:35:28.720
to help us here.

00:35:28.720 --> 00:35:33.540
So I'll go ahead and go into ngrams
and we'll take a look at ngrams.py.

00:35:33.540 --> 00:35:36.280
And what we have here
is we are going to take

00:35:36.280 --> 00:35:39.190
some corpus of text, just
some sequence of documents,

00:35:39.190 --> 00:35:43.960
and use all those documents and extract
what the most popular n-grams happen

00:35:43.960 --> 00:35:44.800
to be.

00:35:44.800 --> 00:35:48.490
So in order to do so, we're going to
go ahead and load data from a directory

00:35:48.490 --> 00:35:50.510
that we specify in the
command line argument.

00:35:50.510 --> 00:35:53.170
We'll also take in a number
n as a command line argument

00:35:53.170 --> 00:35:55.390
as well, in terms of what
our number should be,

00:35:55.390 --> 00:36:00.480
in terms of how many sequences-- words
we're going to look at in sequence.

00:36:00.480 --> 00:36:05.330
Then we're going to go ahead and
just count up all of the nltk.ngrams.

00:36:05.330 --> 00:36:09.170
So we're going to look at all of
the grams across this entire corpus

00:36:09.170 --> 00:36:11.600
and save it inside this variable ngrams.

00:36:11.600 --> 00:36:14.090
And then we're going to
look at the most common ones

00:36:14.090 --> 00:36:15.423
and go ahead and print them out.

00:36:15.423 --> 00:36:18.020
And so in order to do so,
I'm not only using NLTK--

00:36:18.020 --> 00:36:21.290
I'm also using counter, which is built
into Python as well, where I can just

00:36:21.290 --> 00:36:25.800
count up, how many times do these
various different grams appear?

00:36:25.800 --> 00:36:27.480
So we'll go ahead and show that.

00:36:27.480 --> 00:36:31.500
We'll go into ngrams, and I'll
say something like python ngrams--

00:36:31.500 --> 00:36:34.020
and let's just first look
for the unigrams, sequences

00:36:34.020 --> 00:36:37.000
of one word inside of a corpus.

00:36:37.000 --> 00:36:39.270
And the corpus that
I've prepared is I have

00:36:39.270 --> 00:36:42.720
all of the-- or some of these
stories from Sherlock Holmes

00:36:42.720 --> 00:36:47.140
all here, where each one is just
one of the Sherlock Holmes stories.

00:36:47.140 --> 00:36:50.010
And so I have a whole bunch of
text here inside of this corpus,

00:36:50.010 --> 00:36:54.270
and I'll go ahead and provide that
corpus as a command line argument.

00:36:54.270 --> 00:36:55.980
And now what my program
is going to do is

00:36:55.980 --> 00:36:59.000
it's going to load all of the
Sherlock Holmes stories into memory--

00:36:59.000 --> 00:37:01.500
or all the ones that I've
provided in this corpus at least--

00:37:01.500 --> 00:37:04.200
and it's just going to look
for the most popular unigrams,

00:37:04.200 --> 00:37:07.050
the most popular sequences of one word.

00:37:07.050 --> 00:37:12.060
And it seems the most popular one is
just the word the used in 9,700 times;

00:37:12.060 --> 00:37:15.930
followed by I, used 5,000 times;
and, used about 5,000 times--

00:37:15.930 --> 00:37:18.370
the kinds of words you might expect.

00:37:18.370 --> 00:37:24.900
So now let's go ahead and check for
bigrams, for example, ngrams 2, holmes.

00:37:24.900 --> 00:37:28.740
All right, again, sequences of two
words now that appear multiple times--

00:37:28.740 --> 00:37:32.840
of the, in the, it was, to the, it
is, I have-- so on and so forth.

00:37:32.840 --> 00:37:34.590
These are the types
of bigrams that happen

00:37:34.590 --> 00:37:37.590
to come up quite often inside this
corpus, the inside of the Sherlock

00:37:37.590 --> 00:37:38.400
Holmes stories.

00:37:38.400 --> 00:37:41.060
And it probably is true
across other corpses as well,

00:37:41.060 --> 00:37:43.472
but we could only find out
if we actually tested it.

00:37:43.472 --> 00:37:45.180
And now, just for good
measure, let's try

00:37:45.180 --> 00:37:50.120
one more-- maybe try three, looking now
for trigrams that happen to show up.

00:37:50.120 --> 00:37:54.570
And now we get it was the, one
of the, I think that, out of the.

00:37:54.570 --> 00:37:56.850
These are sequences of
three words now that

00:37:56.850 --> 00:38:00.900
happen to come up multiple times
across this particular corpus.

00:38:00.900 --> 00:38:02.970
So what are the
potential use cases here?

00:38:02.970 --> 00:38:04.440
Now we have some sort of data.

00:38:04.440 --> 00:38:07.890
We have data about how often
particular sequences of words

00:38:07.890 --> 00:38:11.010
show up in this particular
order, and using that,

00:38:11.010 --> 00:38:13.410
we can begin to do some
sort of predictions.

00:38:13.410 --> 00:38:18.090
We might be able to say that, if
you see the words that it was,

00:38:18.090 --> 00:38:19.950
there's a reasonable
chance the word that

00:38:19.950 --> 00:38:22.130
comes after it should be the word a.

00:38:22.130 --> 00:38:26.340
And if I see the words one of,
it it's reasonable to imagine

00:38:26.340 --> 00:38:29.190
that the next word might be
the word the, for example,

00:38:29.190 --> 00:38:32.640
because we have this data about
trigrams, sequences of three words

00:38:32.640 --> 00:38:33.900
and how often they come up.

00:38:33.900 --> 00:38:36.150
And now, based on two
words, you might be

00:38:36.150 --> 00:38:40.110
able to predict what the
third word happens to be.

00:38:40.110 --> 00:38:43.650
And one model we can use for that is
a model we've actually seen before.

00:38:43.650 --> 00:38:45.280
It's the Markov model.

00:38:45.280 --> 00:38:47.100
Recall again that the
Markov model really

00:38:47.100 --> 00:38:50.010
just refers to some sequence
of events that happen one time

00:38:50.010 --> 00:38:54.150
step after a one time step,
where every unit has some ability

00:38:54.150 --> 00:38:57.150
to predict what the next
unit is going to be--

00:38:57.150 --> 00:39:00.330
or maybe the past two units predict
with the next unit is going to be,

00:39:00.330 --> 00:39:03.270
or the past three predict with
the next one is going to be.

00:39:03.270 --> 00:39:05.490
And we can use a Markov
model and apply it

00:39:05.490 --> 00:39:08.100
to language for a very
naive and simple approach

00:39:08.100 --> 00:39:11.340
at trying to generate natural
language, at getting our AI

00:39:11.340 --> 00:39:14.340
to be able to speak English-like text.

00:39:14.340 --> 00:39:18.360
And the way it's going to work is we're
going to say something like, come up

00:39:18.360 --> 00:39:20.280
with some probability distribution.

00:39:20.280 --> 00:39:23.070
Given these two words,
what is the probability

00:39:23.070 --> 00:39:25.830
distribution over what the
third word could possibly

00:39:25.830 --> 00:39:27.240
be based on all the data?

00:39:27.240 --> 00:39:30.660
If you see it was, what are the
possible third words we might?

00:39:30.660 --> 00:39:32.190
Have how often do they come up?

00:39:32.190 --> 00:39:35.070
And using that information,
we can try and construct

00:39:35.070 --> 00:39:37.450
what we expect the third word to be.

00:39:37.450 --> 00:39:39.270
And if you keep doing
this, the effect is

00:39:39.270 --> 00:39:42.030
that our Markov model
can effectively start

00:39:42.030 --> 00:39:45.330
to generate text-- can be
able to generate text that

00:39:45.330 --> 00:39:48.330
was not in the original
corpus, but that sounds

00:39:48.330 --> 00:39:49.770
kind of like the original corpus.

00:39:49.770 --> 00:39:54.130
It's using the same sorts of rules
that the original corpus was using.

00:39:54.130 --> 00:39:56.370
So let's take a look
at an example of that

00:39:56.370 --> 00:40:01.740
as well, where here now, I have
another corpus that I have here,

00:40:01.740 --> 00:40:04.990
and it is the corpus of all of
the works of William Shakespeare.

00:40:04.990 --> 00:40:09.900
So I've got a whole bunch of stories
from Shakespeare, and all of them

00:40:09.900 --> 00:40:12.610
are just inside of this big text file.

00:40:12.610 --> 00:40:16.590
And so what I might like to do is
look at what all of the n-grams are--

00:40:16.590 --> 00:40:20.400
maybe look at all the trigrams
inside of shakespeare.txt--

00:40:20.400 --> 00:40:23.040
and figure out, given
two words, can I predict

00:40:23.040 --> 00:40:24.548
what the third word is likely to be?

00:40:24.548 --> 00:40:26.340
And then just keep
repeating this process--

00:40:26.340 --> 00:40:27.240
I have two words--

00:40:27.240 --> 00:40:29.400
predict the third word; then,
from the second and third, word

00:40:29.400 --> 00:40:31.900
predict the fourth word; and
from the third and fourth word,

00:40:31.900 --> 00:40:36.090
predict the fifth word, ultimately
generating random sentences that

00:40:36.090 --> 00:40:39.420
sounds like Shakespeare, that are
using similar patterns of words

00:40:39.420 --> 00:40:43.140
that Shakespeare used, but that never
actually showed up in Shakespeare

00:40:43.140 --> 00:40:44.770
itself.

00:40:44.770 --> 00:40:47.640
And so to do so, I'll
show you generator.py,

00:40:47.640 --> 00:40:50.910
which, again, is just going to
read data from a particular file.

00:40:50.910 --> 00:40:54.210
And I'm using a Python library
called markovify, which is just

00:40:54.210 --> 00:40:56.050
going to do this process for me.

00:40:56.050 --> 00:40:59.370
So there are libraries out here that
can just train on a bunch of text

00:40:59.370 --> 00:41:02.978
and come up with a Markov
model based on that text.

00:41:02.978 --> 00:41:04.770
And I'm going to go
ahead and just generate

00:41:04.770 --> 00:41:07.920
five randomly generated sentences.

00:41:07.920 --> 00:41:11.850
So we'll go ahead and go in to markov.

00:41:11.850 --> 00:41:14.750
I'll run the generator
on shakespeare.txt.

00:41:14.750 --> 00:41:18.290
What we'll see is it's going to load
that data, and then here's what we get.

00:41:18.290 --> 00:41:21.320
We get five different
sentences, and these

00:41:21.320 --> 00:41:24.890
are sentences that never showed
up in any Shakespeare play,

00:41:24.890 --> 00:41:27.680
but that are designed to
sound like Shakespeare,

00:41:27.680 --> 00:41:30.320
that are designed to just
take two words and predict,

00:41:30.320 --> 00:41:34.100
given those two words, what would
Shakespeare have been likely to choose

00:41:34.100 --> 00:41:35.517
as the third word that follows it.

00:41:35.517 --> 00:41:38.100
And you know, these sentences
probably don't have any meaning.

00:41:38.100 --> 00:41:41.600
It's not like the AI is trying to
express any sort of underlying meaning

00:41:41.600 --> 00:41:42.110
here.

00:41:42.110 --> 00:41:44.870
It's just trying to understand,
based on the sequence

00:41:44.870 --> 00:41:50.190
of words, what is likely to come
after it as a next word, for example.

00:41:50.190 --> 00:41:53.593
And these are the types of sentences
that it's able to come up with,

00:41:53.593 --> 00:41:54.260
just generating.

00:41:54.260 --> 00:41:58.100
And if you ran this multiple times, you
would end up getting different results.

00:41:58.100 --> 00:42:01.580
I could run this again and
get an entirely different set

00:42:01.580 --> 00:42:04.100
of five different
sentences that also are

00:42:04.100 --> 00:42:08.810
supposed to sound kind of like the way
that Shakespeare's sentences sounded

00:42:08.810 --> 00:42:10.340
as well.

00:42:10.340 --> 00:42:12.430
And so that then was
a look at how it is we

00:42:12.430 --> 00:42:16.580
can use Markov models to be able to
naively attempt generating language.

00:42:16.580 --> 00:42:18.580
The language doesn't mean
a whole lot right now.

00:42:18.580 --> 00:42:21.430
You wouldn't want to use the
system in its current form

00:42:21.430 --> 00:42:23.200
to do something like
machine translation,

00:42:23.200 --> 00:42:26.020
because it wouldn't be able
to encapsulate any meaning,

00:42:26.020 --> 00:42:30.240
but we're starting to see now that
our AI is getting a little bit better

00:42:30.240 --> 00:42:31.990
at trying to speak our
language, at trying

00:42:31.990 --> 00:42:36.500
to be able to process natural language
in some sort of meaningful way.

00:42:36.500 --> 00:42:38.830
So we'll now take a look
at a couple of other tasks

00:42:38.830 --> 00:42:41.140
that we might want our
AI to be able to perform.

00:42:41.140 --> 00:42:44.920
And one such task is text
categorization, which really is just

00:42:44.920 --> 00:42:46.138
a classification problem.

00:42:46.138 --> 00:42:48.430
And we've talked about
classification problems already,

00:42:48.430 --> 00:42:51.670
these problems where we would
like to take some object

00:42:51.670 --> 00:42:54.540
and categorize it into a
number of different classes.

00:42:54.540 --> 00:42:58.750
And so the way this comes up in text
is anytime you have some sample of text

00:42:58.750 --> 00:43:02.080
and you want to put it inside of a
category, where I want to say something

00:43:02.080 --> 00:43:06.760
like, given an email, does it belong
in the inbox or does it belong in spam?

00:43:06.760 --> 00:43:08.890
Which of these two
categories does it belong in?

00:43:08.890 --> 00:43:12.250
And you do that by looking
at the text and being

00:43:12.250 --> 00:43:16.660
able to do some sort of analysis on that
text to be able to draw conclusions,

00:43:16.660 --> 00:43:20.200
to be able to say that, given the
words that show up in the email,

00:43:20.200 --> 00:43:22.510
I think this is probably
belonging in the inbox,

00:43:22.510 --> 00:43:25.825
or I think it probably
belongs in spam instead.

00:43:25.825 --> 00:43:27.700
And you might imagine
doing this for a number

00:43:27.700 --> 00:43:30.910
of different types of classification
problems of this sort.

00:43:30.910 --> 00:43:34.360
So you might imagine that another
common example of this type of idea

00:43:34.360 --> 00:43:37.690
is something like sentiment
analysis, where I want to analyze,

00:43:37.690 --> 00:43:41.880
given a sample of text, does
it have a positive sentiment

00:43:41.880 --> 00:43:43.780
or does it have a negative sentiment?

00:43:43.780 --> 00:43:47.082
And this might come up in the case
of a product reviews on a website,

00:43:47.082 --> 00:43:50.290
for example, or feedback on a website,
where you have a whole bunch of data--

00:43:50.290 --> 00:43:53.230
samples of text that are
provided by users of a website--

00:43:53.230 --> 00:43:57.010
and you want to be able to quickly
analyze, are these reviews positive,

00:43:57.010 --> 00:43:59.710
are the reviews negative,
what is it that people

00:43:59.710 --> 00:44:03.460
are saying, just to get a sense for
what it is that people are saying,

00:44:03.460 --> 00:44:08.840
to be able to categorize text into
one of these two different categories.

00:44:08.840 --> 00:44:10.630
So how might we approach this problem?

00:44:10.630 --> 00:44:13.010
Well, let's take a look at
some sample product reviews.

00:44:13.010 --> 00:44:16.000
Here are some sample prep reviews
that we might come up with.

00:44:16.000 --> 00:44:16.930
My grandson loved it.

00:44:16.930 --> 00:44:17.890
So much fun.

00:44:17.890 --> 00:44:20.290
Product broke after a few days.

00:44:20.290 --> 00:44:22.368
One of the best games I've
played in a long time.

00:44:22.368 --> 00:44:23.410
Kind of cheap and flimsy.

00:44:23.410 --> 00:44:24.400
Not worth it.

00:44:24.400 --> 00:44:28.360
Different product reviews that you
might imagine seeing on Amazon, or eBay,

00:44:28.360 --> 00:44:31.690
or some other website where people
are selling products, for instance.

00:44:31.690 --> 00:44:34.480
And we humans can pretty
easily categorize these

00:44:34.480 --> 00:44:37.060
into positive sentiment
or negative sentiment.

00:44:37.060 --> 00:44:39.790
We'd probably say that the
first and the third one, those

00:44:39.790 --> 00:44:41.620
are positive sentiment messages.

00:44:41.620 --> 00:44:44.380
The second one and the fourth
one, those are probably

00:44:44.380 --> 00:44:46.060
negative sentiment messages.

00:44:46.060 --> 00:44:48.680
But how could a computer
do the same thing?

00:44:48.680 --> 00:44:53.470
How could it try and take these
reviews and assess, are they positive

00:44:53.470 --> 00:44:55.420
or are they negative?

00:44:55.420 --> 00:44:57.940
Well, ultimately, it
depends upon the words

00:44:57.940 --> 00:45:02.530
that happen to be in this particular--
these particular reviews-- inside

00:45:02.530 --> 00:45:03.850
of these particular sentences.

00:45:03.850 --> 00:45:06.040
For now we're going to
ignore the structure

00:45:06.040 --> 00:45:08.120
and how the words are
related to each other,

00:45:08.120 --> 00:45:11.230
and we're just going to focus
on what the words actually are.

00:45:11.230 --> 00:45:14.710
So there are probably some key
words here, words like loved,

00:45:14.710 --> 00:45:16.330
and fun, and best.

00:45:16.330 --> 00:45:20.770
Those probably show up in more
positive reviews, whereas words

00:45:20.770 --> 00:45:23.137
like broke, and cheap, and flimsy--

00:45:23.137 --> 00:45:24.970
well, those are words
that probably are more

00:45:24.970 --> 00:45:29.930
likely to come up inside of negative
reviews, instead of positive reviews.

00:45:29.930 --> 00:45:33.550
So one way to approach this
sort of text analysis idea

00:45:33.550 --> 00:45:37.900
is to say, let's, for now, ignore the
structures of these sentences-- to say,

00:45:37.900 --> 00:45:40.870
we're not going to care about how it
is the words relate to each other.

00:45:40.870 --> 00:45:43.540
We're not going to try and parse
these sentences to construct

00:45:43.540 --> 00:45:45.850
the grammatical structure
like we saw a moment ago.

00:45:45.850 --> 00:45:49.060
But we can probably just rely
on the words that were actually

00:45:49.060 --> 00:45:52.000
used-- rely on the fact that
the positive reviews are

00:45:52.000 --> 00:45:54.820
more likely to have words
like best, and loved, and fun,

00:45:54.820 --> 00:45:58.360
and that the negative reviews are
more likely to have the negative words

00:45:58.360 --> 00:46:00.017
that we've highlighted there as well.

00:46:00.017 --> 00:46:03.100
And this sort of model-- this approach
to trying to think about language--

00:46:03.100 --> 00:46:05.610
is generally known as
the bag of words model,

00:46:05.610 --> 00:46:09.023
where we're going to model a sample of
text not by caring about its structure,

00:46:09.023 --> 00:46:12.970
but just by caring about the
unordered collection of words that

00:46:12.970 --> 00:46:16.060
show up inside of a sample--
that all we care about

00:46:16.060 --> 00:46:18.040
is what words are in the text.

00:46:18.040 --> 00:46:20.552
And we don't care about what
the order of those words is.

00:46:20.552 --> 00:46:22.510
We don't care about the
structure of the words.

00:46:22.510 --> 00:46:25.210
We don't care what noun
goes with what adjective

00:46:25.210 --> 00:46:26.870
or how things agree with each other.

00:46:26.870 --> 00:46:28.830
We just care about the words.

00:46:28.830 --> 00:46:31.120
And it turns out this
approach tends to work

00:46:31.120 --> 00:46:34.810
pretty well for doing classifications
like positive sentiment

00:46:34.810 --> 00:46:36.142
or negative sentiment.

00:46:36.142 --> 00:46:38.350
And you could imagine doing
this in a number of ways.

00:46:38.350 --> 00:46:41.740
We've talked about different approaches
to trying to solve classification style

00:46:41.740 --> 00:46:43.870
problems, but when it
comes to natural language,

00:46:43.870 --> 00:46:48.110
one of the most popular approaches
is that naive Bayes approach.

00:46:48.110 --> 00:46:52.530
And this is one approach to trying to
analyze the probability that something

00:46:52.530 --> 00:46:54.940
is positive sentiment
or negative sentiment,

00:46:54.940 --> 00:46:58.515
or just trying to categorize it
some text into possible categories.

00:46:58.515 --> 00:47:01.390
And it doesn't just work for text--
it works for other types of ideas

00:47:01.390 --> 00:47:03.550
as well-- but it is quite
popular in the world

00:47:03.550 --> 00:47:05.980
of analyzing text and natural language.

00:47:05.980 --> 00:47:09.450
And the naive Bayes approach
is based on Bayes' rule, which

00:47:09.450 --> 00:47:11.950
you might recall back from when
we talked about probability,

00:47:11.950 --> 00:47:14.020
that the Bayes' rule looks like this--

00:47:14.020 --> 00:47:17.690
that the probability of
some event b, given a

00:47:17.690 --> 00:47:20.320
can be expressed using
this expression over here.

00:47:20.320 --> 00:47:25.150
Probability of b given a is the
probability of a given b multiplied

00:47:25.150 --> 00:47:28.590
by the probability of b divided
by the probability of a.

00:47:28.590 --> 00:47:32.290
And we saw that this came about
as a result of just the definition

00:47:32.290 --> 00:47:35.740
of conditional independence and
looking at what it means for two events

00:47:35.740 --> 00:47:37.010
to happen together.

00:47:37.010 --> 00:47:40.038
This was our formulation
then of Bayes' rule, which

00:47:40.038 --> 00:47:41.330
turned out to be quite helpful.

00:47:41.330 --> 00:47:43.990
We were able to predict one
event in terms of another

00:47:43.990 --> 00:47:49.218
by flipping the order of those events
inside of this probability calculation.

00:47:49.218 --> 00:47:51.760
And it turns out this approach
is going to be quite helpful--

00:47:51.760 --> 00:47:53.110
and we'll see why in a moment--

00:47:53.110 --> 00:47:55.330
for being able to do this
sort of sentiment analysis,

00:47:55.330 --> 00:47:58.750
because I want to say you
know, what is the probability

00:47:58.750 --> 00:48:02.350
that a message is positive,
or what is the pop probability

00:48:02.350 --> 00:48:03.727
that the message is negative?

00:48:03.727 --> 00:48:06.310
And I'll go ahead and simplify
this just using the emojis just

00:48:06.310 --> 00:48:10.450
for simplicity-- probability of
positive, probability of negative.

00:48:10.450 --> 00:48:12.340
And that is what I
would like to calculate,

00:48:12.340 --> 00:48:15.310
but I'd like to calculate
that given some information--

00:48:15.310 --> 00:48:18.940
given information like
here is a sample of text--

00:48:18.940 --> 00:48:20.440
my grandson loved it.

00:48:20.440 --> 00:48:24.280
And I would like to know not just what
is the probability that any message is

00:48:24.280 --> 00:48:27.880
positive, but what is the probability
that the message is positive,

00:48:27.880 --> 00:48:32.890
given my grandson loved it
as the text of the sample?

00:48:32.890 --> 00:48:36.340
So given this information that inside
the sample are the words my grandson

00:48:36.340 --> 00:48:41.860
loved it, what is the probability
then that this is a positive message?

00:48:41.860 --> 00:48:44.650
Well, according to the bag of
words model, what we're going to do

00:48:44.650 --> 00:48:46.930
is really ignore the
ordering of the words--

00:48:46.930 --> 00:48:50.420
not treat this as a single sentence
that has some structure to it,

00:48:50.420 --> 00:48:52.750
but just treat it as a whole
bunch of different words.

00:48:52.750 --> 00:48:55.180
We're going to say something
like, what is the probability

00:48:55.180 --> 00:48:58.420
that this is a positive
message, given that the word my

00:48:58.420 --> 00:49:01.810
was in the message, given that the
word grandson was in the message,

00:49:01.810 --> 00:49:05.520
given that the word loved within
the message, and given the word it

00:49:05.520 --> 00:49:06.380
was in the message?

00:49:06.380 --> 00:49:07.720
The bag of words model here--

00:49:07.720 --> 00:49:11.380
we're treating the entire simple
sample as just a whole bunch

00:49:11.380 --> 00:49:12.740
of different words.

00:49:12.740 --> 00:49:15.910
And so this then is what I'd like
to calculate, this probability--

00:49:15.910 --> 00:49:18.610
given all those words,
what is the probability

00:49:18.610 --> 00:49:20.920
that this is a positive message?

00:49:20.920 --> 00:49:23.530
And this is where we can
now apply Bayes' rule.

00:49:23.530 --> 00:49:28.315
This is really the probability
of some b, given some a.

00:49:28.315 --> 00:49:30.400
And that now is what
I'd like to calculate.

00:49:30.400 --> 00:49:34.723
So according to Bayes' rule, this
whole expression is equal to--

00:49:34.723 --> 00:49:35.890
well, it's the probability--

00:49:35.890 --> 00:49:37.420
I switched the order of them--

00:49:37.420 --> 00:49:40.270
it's the probability
of all of these words,

00:49:40.270 --> 00:49:42.910
given that it's a positive
message, multiplied

00:49:42.910 --> 00:49:46.930
by the probability that is
the positive message divided

00:49:46.930 --> 00:49:49.575
by the probability of
all of those words.

00:49:49.575 --> 00:49:51.700
So this then is just an
application of Bayes' rule.

00:49:51.700 --> 00:49:56.680
We've already seen where I want to
express the probability of positive,

00:49:56.680 --> 00:50:02.440
given the words, as related to
somehow the probability of the words,

00:50:02.440 --> 00:50:04.718
given that it's a positive message.

00:50:04.718 --> 00:50:06.760
And it turns out that--
as you might recall, back

00:50:06.760 --> 00:50:09.965
when we talked about probability,
that this denominator is

00:50:09.965 --> 00:50:10.840
going to be the same.

00:50:10.840 --> 00:50:13.840
Regardless of whether we're looking
at positive or negative messages,

00:50:13.840 --> 00:50:15.850
the probability of these
words doesn't change,

00:50:15.850 --> 00:50:18.805
because we don't have a
positive or negative down below.

00:50:18.805 --> 00:50:20.680
So we can just say that,
rather than just say

00:50:20.680 --> 00:50:23.980
that this expression up here is
equal to this expression down below,

00:50:23.980 --> 00:50:27.130
it's really just proportional
to just the numerator.

00:50:27.130 --> 00:50:29.530
We can ignore the denominator for now.

00:50:29.530 --> 00:50:32.770
Using the denominator would
get us an exact probability.

00:50:32.770 --> 00:50:34.780
But it turns out that
what we'll really just do

00:50:34.780 --> 00:50:38.780
is figure out what the probability
is proportional to, and at the end,

00:50:38.780 --> 00:50:41.500
we'll have to normalize the
probability distribution-- make

00:50:41.500 --> 00:50:46.270
sure the probability distribution
ultimately sums up to the number 1.

00:50:46.270 --> 00:50:49.730
So now I've been able to
formulate this probability--

00:50:49.730 --> 00:50:51.520
which is what I want to care about--

00:50:51.520 --> 00:50:56.530
as proportional to multiplying these two
things together-- probability of words,

00:50:56.530 --> 00:51:01.580
given positive message, multiplied by
the probability of positive message.

00:51:01.580 --> 00:51:04.060
But again, if you think back
to our probability rules,

00:51:04.060 --> 00:51:09.070
we can calculate this really as just
a joint probability of all of these

00:51:09.070 --> 00:51:14.140
things happening-- that the probability
of positive message multiplied

00:51:14.140 --> 00:51:17.470
by the probability of these words,
given the positive message--

00:51:17.470 --> 00:51:20.890
well, that's just the joint
probability of all of these things.

00:51:20.890 --> 00:51:23.530
This is the same thing
as the probability

00:51:23.530 --> 00:51:27.670
that it's a positive message, and my
isn't the sentence or in the message,

00:51:27.670 --> 00:51:30.820
and grandson is in the sample,
and loved is in the sample,

00:51:30.820 --> 00:51:33.160
and it is in the sample.

00:51:33.160 --> 00:51:36.640
So using that rule for the
definition of joint probability,

00:51:36.640 --> 00:51:40.630
I've been able to say that
this entire expression is now

00:51:40.630 --> 00:51:43.570
proportional to this sequence--

00:51:43.570 --> 00:51:47.530
this joint probability of these
words and this positive that's

00:51:47.530 --> 00:51:49.670
in there as well.

00:51:49.670 --> 00:51:51.790
And so now the interesting
question is just how

00:51:51.790 --> 00:51:54.050
to calculate that joint probability.

00:51:54.050 --> 00:51:55.870
How do I figure out
the probability that,

00:51:55.870 --> 00:51:59.980
given some arbitrary message, that it is
positive, and the word my is in there,

00:51:59.980 --> 00:52:03.040
and the word grandson is in there,
and the word loved is in there,

00:52:03.040 --> 00:52:04.740
and the word it is in there?

00:52:04.740 --> 00:52:07.990
Well, you'll recall that we can
calculate a joint probability

00:52:07.990 --> 00:52:12.480
by multiplying together all of
these conditional probabilities.

00:52:12.480 --> 00:52:16.350
If I want to know the
probability of a, and b, and c,

00:52:16.350 --> 00:52:19.530
I can calculate that as
the probability of a times

00:52:19.530 --> 00:52:24.300
the probability of b, given a, times
the probability of c, given a and b.

00:52:24.300 --> 00:52:27.570
I can just multiply these
conditional probabilities together

00:52:27.570 --> 00:52:31.290
in order to get the overall joint
probability that I care about.

00:52:31.290 --> 00:52:32.790
And we could do the same thing here.

00:52:32.790 --> 00:52:35.340
I could say, let's
multiply the probability

00:52:35.340 --> 00:52:39.180
of positive by the probability of the
word my showing up in the message,

00:52:39.180 --> 00:52:42.810
given that it's positive, multiplied
by the probability of grandson

00:52:42.810 --> 00:52:45.550
showing up in the message, given
that the word my is in there

00:52:45.550 --> 00:52:48.930
and that it's positive, multiplied
by the probability of loved,

00:52:48.930 --> 00:52:51.930
given these three things,
multiplied by the probability of it,

00:52:51.930 --> 00:52:53.500
given these four things.

00:52:53.500 --> 00:52:56.882
And that's going to end up being a
fairly complex calculation to make,

00:52:56.882 --> 00:52:58.590
one that we probably
aren't going to have

00:52:58.590 --> 00:53:00.210
a good way of knowing the answer to.

00:53:00.210 --> 00:53:04.140
What is the probability that
grandson is in the message, given

00:53:04.140 --> 00:53:08.010
that it is positive and the
word my is in the message?

00:53:08.010 --> 00:53:12.040
That's not something we're really
going to have a readily easy answer to,

00:53:12.040 --> 00:53:15.270
and so this is where the naive
part of naive Bayes comes about.

00:53:15.270 --> 00:53:16.950
We're going to simplify this notion.

00:53:16.950 --> 00:53:20.340
Rather than compute exactly what
that probability distribution is,

00:53:20.340 --> 00:53:23.880
we're going to assume
that these words are

00:53:23.880 --> 00:53:26.710
going to be effectively
independent of each other,

00:53:26.710 --> 00:53:28.980
if we know that it's
already a positive message.

00:53:28.980 --> 00:53:32.670
If it's a positive message, it
doesn't change the probability

00:53:32.670 --> 00:53:34.620
that the word grandson
is in the message,

00:53:34.620 --> 00:53:37.620
if I know that the word loved
is in the message, for example.

00:53:37.620 --> 00:53:39.750
And that might not necessarily
be true in practice.

00:53:39.750 --> 00:53:41.610
In the real world, it
might not be the case

00:53:41.610 --> 00:53:43.650
that these words are
actually independent,

00:53:43.650 --> 00:53:45.960
but we're going to assume
it to simplify our model.

00:53:45.960 --> 00:53:48.030
And it turns out that
simplification still

00:53:48.030 --> 00:53:51.590
lets us get pretty good
results out of it as well.

00:53:51.590 --> 00:53:55.320
And what we're going to assume is that
the probability that all of these words

00:53:55.320 --> 00:53:58.690
show up depend only on whether
it's positive or negative.

00:53:58.690 --> 00:54:01.170
I can still say that loved
is more likely to come up

00:54:01.170 --> 00:54:04.510
in a positive message than a negative
message, which is probably true,

00:54:04.510 --> 00:54:08.010
but we're also going to say that it's
not going to change whether or not

00:54:08.010 --> 00:54:12.020
loved is more likely or less likely to
come up if I know that the word my is

00:54:12.020 --> 00:54:13.643
in the message, for example.

00:54:13.643 --> 00:54:16.060
And so those are the assumptions
that we're going to make.

00:54:16.060 --> 00:54:20.310
So while top expression is
proportional to this bottom expression,

00:54:20.310 --> 00:54:24.750
we're going to say it's naively
proportional to this expression,

00:54:24.750 --> 00:54:27.480
probability of being a positive message.

00:54:27.480 --> 00:54:30.300
And then, for each of the words
that show up in the sample,

00:54:30.300 --> 00:54:33.270
I'm going to multiply what's
the probability that my

00:54:33.270 --> 00:54:35.370
is in the message, given
that it's positive,

00:54:35.370 --> 00:54:37.980
times the probability of grandson
being in the message, given

00:54:37.980 --> 00:54:40.050
that it's positive-- and
then so on and so forth

00:54:40.050 --> 00:54:44.040
for the other words that happen
to be inside of the sample.

00:54:44.040 --> 00:54:47.580
And it turns out that these are
numbers that we can calculate.

00:54:47.580 --> 00:54:50.640
The reason we've done all of this
math is to get to this point,

00:54:50.640 --> 00:54:54.870
to be able to calculate this probability
of distribution that we care about,

00:54:54.870 --> 00:54:58.410
given these terms that we
can actually calculate.

00:54:58.410 --> 00:55:02.250
And we can calculate then,
given some data available to us.

00:55:02.250 --> 00:55:04.530
And this is what a lot of
natural language processing

00:55:04.530 --> 00:55:05.590
is about these days.

00:55:05.590 --> 00:55:07.330
It's about analyzing data.

00:55:07.330 --> 00:55:10.440
If I give you a whole bunch of
data with a whole bunch of reviews,

00:55:10.440 --> 00:55:13.380
and I've labeled them
as positive or negative,

00:55:13.380 --> 00:55:17.250
then you can begin to calculate
these particular terms.

00:55:17.250 --> 00:55:20.490
I can calculate the probability
that a message is positive just

00:55:20.490 --> 00:55:22.710
by looking at my data
and saying, how many

00:55:22.710 --> 00:55:26.250
positive samples were there, and divide
that by the number of total samples.

00:55:26.250 --> 00:55:29.477
That is my probability
that a message is positive.

00:55:29.477 --> 00:55:32.310
What is the probability that the
word loved is in the message, given

00:55:32.310 --> 00:55:33.330
that it's positive?

00:55:33.330 --> 00:55:35.490
Well, I can calculate
that based on my data too.

00:55:35.490 --> 00:55:38.970
Let me just look at how many positive
samples have the word loved in it

00:55:38.970 --> 00:55:41.730
and divide that by my total
number of positive samples.

00:55:41.730 --> 00:55:44.430
And that will give me
an approximation for,

00:55:44.430 --> 00:55:47.950
what is the probability that loved is
going to show up inside of the review,

00:55:47.950 --> 00:55:51.570
given that we know that
the review is positive.

00:55:51.570 --> 00:55:55.160
And so this then allows us to be able
to calculate these probabilities.

00:55:55.160 --> 00:55:56.910
So let's not actually
do this calculation.

00:55:56.910 --> 00:56:00.390
Let's calculate for the
sentence, my grandson loved it.

00:56:00.390 --> 00:56:01.890
Is it a positive or negative review?

00:56:01.890 --> 00:56:04.030
How could we figure out
those probabilities?

00:56:04.030 --> 00:56:07.110
Well, again, this up here is the
expression we're trying to calculate.

00:56:07.110 --> 00:56:10.350
And I'll give you a hint the
data that is available to us.

00:56:10.350 --> 00:56:13.080
And the way to interpret
this data in this case

00:56:13.080 --> 00:56:19.127
is that, of all of the messages, 49%
of them were positive and 51% of them

00:56:19.127 --> 00:56:19.710
were negative.

00:56:19.710 --> 00:56:22.350
Maybe online reviews tend to be a
little bit more negative than they

00:56:22.350 --> 00:56:24.683
are positive-- or at least
based on this particular data

00:56:24.683 --> 00:56:26.620
sample, that's what I have.

00:56:26.620 --> 00:56:31.800
And then I have distributions for
each of the various different words--

00:56:31.800 --> 00:56:34.290
that, given that it's
a positive message,

00:56:34.290 --> 00:56:38.040
how many positive messages
had the word in my in them?

00:56:38.040 --> 00:56:39.335
It's about 30%.

00:56:39.335 --> 00:56:42.210
And for negative messages, how many
of those had the word my in them?

00:56:42.210 --> 00:56:47.910
About 20%-- so it seems like the word
my comes up more often in positive

00:56:47.910 --> 00:56:52.140
messages-- at least slightly more
often based on this analysis here.

00:56:52.140 --> 00:56:54.270
Grandson, for example--
maybe that showed up

00:56:54.270 --> 00:56:58.680
in 1% of all positive messages
and 2% of all negative messages

00:56:58.680 --> 00:57:00.330
had the word grandson in it.

00:57:00.330 --> 00:57:05.010
The word loved showed up in 32%
of all positive messages, 8%

00:57:05.010 --> 00:57:07.090
of all negative messages, for example.

00:57:07.090 --> 00:57:10.230
And then the word it up in
30% of positive messages,

00:57:10.230 --> 00:57:15.130
40% of negative messages-- again, just
arbitrary data here just for example,

00:57:15.130 --> 00:57:19.560
but now we have data with which we can
begin to calculate this expression.

00:57:19.560 --> 00:57:22.950
So how do I calculate multiplying
all these values together?

00:57:22.950 --> 00:57:25.650
Well, it's just going to
be multiplying probability

00:57:25.650 --> 00:57:29.400
that it's positive times the
probability of my, given positive,

00:57:29.400 --> 00:57:32.190
times the probability of
grandson, given positive--

00:57:32.190 --> 00:57:34.290
so on and so forth for
each of the other words.

00:57:34.290 --> 00:57:37.780
And if you do that multiplication and
multiply all of those values together,

00:57:37.780 --> 00:57:42.000
you get this, 0.00014112.

00:57:42.000 --> 00:57:44.760
By itself, this is not
a meaningful number,

00:57:44.760 --> 00:57:48.810
but it's going to be meaningful
if you compared this expression--

00:57:48.810 --> 00:57:53.250
the probability that it's positive times
the probability of all of the words,

00:57:53.250 --> 00:57:55.680
given that I know that
the message is positive,

00:57:55.680 --> 00:57:59.350
and compare it to the same thing,
but for negative sentiment messages

00:57:59.350 --> 00:57:59.850
instead.

00:57:59.850 --> 00:58:03.090
I want to know the probability
that it's a negative message

00:58:03.090 --> 00:58:05.430
times the probability
of all of these words,

00:58:05.430 --> 00:58:07.900
given that it's a negative message.

00:58:07.900 --> 00:58:09.360
And so how can I do that?

00:58:09.360 --> 00:58:13.280
Well, to do that, you just multiply
probability of negative times

00:58:13.280 --> 00:58:15.500
all of these conditional probabilities.

00:58:15.500 --> 00:58:19.520
And if I take those five values,
multiply all of them together,

00:58:19.520 --> 00:58:26.730
then what I get is this value
for negative 0.00006528--

00:58:26.730 --> 00:58:30.080
again, in isolation, not a
particularly meaningful number.

00:58:30.080 --> 00:58:35.300
What is meaningful is treating these
two values as a probability distribution

00:58:35.300 --> 00:58:39.260
and normalizing them, making it so
that both of these values sum up to 1

00:58:39.260 --> 00:58:41.450
the way of probability
distribution should.

00:58:41.450 --> 00:58:45.740
And we do so by adding these two up
and then dividing each of these values

00:58:45.740 --> 00:58:48.120
by their total in order to
be able to normalize them.

00:58:48.120 --> 00:58:51.170
And when we do that, when we normalize
this probability distribution,

00:58:51.170 --> 00:58:58.400
you end up getting something like
this, positive 0.6837, negative 0.3163.

00:58:58.400 --> 00:59:02.990
It seems like we've been able to
conclude that we are about 68%

00:59:02.990 --> 00:59:06.500
confident-- we think there's
a probability of 0.68

00:59:06.500 --> 00:59:09.470
that this message is a positive
message-- my grandson loved it.

00:59:09.470 --> 00:59:11.540
And why are we 68% confident?

00:59:11.540 --> 00:59:15.350
Well, it seems like we're more
confident than not because the word

00:59:15.350 --> 00:59:18.350
loved showed up in 32%
of positive messages,

00:59:18.350 --> 00:59:20.420
but only 8% of negative messages.

00:59:20.420 --> 00:59:22.410
So that was a pretty strong indicator.

00:59:22.410 --> 00:59:25.070
And for the others, while
it's true that the word

00:59:25.070 --> 00:59:27.260
it showed up more often
in negative messages,

00:59:27.260 --> 00:59:30.170
it wasn't enough to
offset that loved shows up

00:59:30.170 --> 00:59:34.560
far more often in positive
messages than negative messages.

00:59:34.560 --> 00:59:37.970
And so this type of analysis is
how we can apply naive Bayes.

00:59:37.970 --> 00:59:39.650
We've just done this calculation.

00:59:39.650 --> 00:59:42.933
And we end up getting not just a
categorization of positive or negative,

00:59:42.933 --> 00:59:44.600
but I get some sort of confidence level.

00:59:44.600 --> 00:59:47.660
What do I think the probability
is that it's positive?

00:59:47.660 --> 00:59:52.560
And I can say I think it's positive
with this particular probability.

00:59:52.560 --> 00:59:55.820
And so naive Bayes can be quite
powerful at trying to achieve this.

00:59:55.820 --> 00:59:58.250
Using just this bag of words
model, where all I'm doing

00:59:58.250 --> 01:00:00.950
is looking at what words
show up in the sample,

01:00:00.950 --> 01:00:03.870
I'm able to draw these
sorts of conclusions.

01:00:03.870 --> 01:00:07.280
Now, one potential drawback-- something
that you'll notice pretty quickly

01:00:07.280 --> 01:00:10.190
if you start applying
this room exactly as is--

01:00:10.190 --> 01:00:15.500
is what happens depending on if
0's are inside this data somewhere.

01:00:15.500 --> 01:00:20.410
Let's imagine, for example, this same
sentence-- my grandson loved it--

01:00:20.410 --> 01:00:24.980
but let's instead imagine that this
value here, instead of being 0.01,

01:00:24.980 --> 01:00:28.970
was 0, meaning inside of
our data set, it has never

01:00:28.970 --> 01:00:33.620
before happened that in a positive
message the word grandson showed up.

01:00:33.620 --> 01:00:35.450
And that's certainly possible.

01:00:35.450 --> 01:00:37.817
If I have a pretty small data
set, it's probably likely

01:00:37.817 --> 01:00:40.400
that not all the messages are
going to have the word grandson.

01:00:40.400 --> 01:00:43.400
Maybe it is the case that no
positive messages have ever

01:00:43.400 --> 01:00:46.370
had the word grandson in
it, at least in my data set.

01:00:46.370 --> 01:00:49.640
But if it is the case that
2% of the negative messages

01:00:49.640 --> 01:00:52.340
have still had the word
grandson in it, then we

01:00:52.340 --> 01:00:54.330
run into an interesting challenge.

01:00:54.330 --> 01:00:57.730
And the challenge is this-- when I
multiply all of the positive numbers

01:00:57.730 --> 01:01:00.980
together and multiply all the negative
numbers together to calculate these two

01:01:00.980 --> 01:01:06.800
probabilities, what I end up getting
is a positive value of 0.000.

01:01:06.800 --> 01:01:10.010
I get pure 0's, because when I
multiply all of these numbers

01:01:10.010 --> 01:01:12.470
together-- when I
multiply something by 0,

01:01:12.470 --> 01:01:15.770
doesn't matter what the other numbers
are-- the result is going to be 0.

01:01:15.770 --> 01:01:19.710
And the same thing can be said
of negative numbers as well.

01:01:19.710 --> 01:01:24.320
So this then would seem to be a problem
that, because grandson has never

01:01:24.320 --> 01:01:27.630
showed up in any of the positive
messages inside of our sample,

01:01:27.630 --> 01:01:31.340
we're able to say-- we seem to
be concluding that there is a 0%

01:01:31.340 --> 01:01:33.110
chance that the message is positive.

01:01:33.110 --> 01:01:37.105
And therefore, it must be negative,
because the only cases where

01:01:37.105 --> 01:01:39.980
we've seen the word grandson come
up is inside of a negative message.

01:01:39.980 --> 01:01:43.340
And in doing so, we've totally
ignored all of the other probabilities

01:01:43.340 --> 01:01:46.940
that a positive message is much more
likely to have the word loved in it,

01:01:46.940 --> 01:01:49.190
because we've multiplied
by 0, which just

01:01:49.190 --> 01:01:53.670
means none of the other probabilities
can possibly matter at all.

01:01:53.670 --> 01:01:55.920
So this then is a challenge
that we need to deal with.

01:01:55.920 --> 01:01:57.380
It means that we're
likely not going to be

01:01:57.380 --> 01:02:00.220
able to get the correct results if
we just purely use this approach.

01:02:00.220 --> 01:02:02.720
And it's for that reason there
are a number of possible ways

01:02:02.720 --> 01:02:06.230
we can try and make sure that we
never multiply something by 0.

01:02:06.230 --> 01:02:08.750
It's OK to multiply
something by a small number,

01:02:08.750 --> 01:02:10.640
because then it can
still be counterbalanced

01:02:10.640 --> 01:02:14.540
by other larger numbers, but multiplying
by 0 means it's the end of the story.

01:02:14.540 --> 01:02:16.520
You multiply a number
by 0, and the output's

01:02:16.520 --> 01:02:21.230
going to be 0, no matter how big any
of the other numbers happen to be.

01:02:21.230 --> 01:02:23.810
So one approach that's fairly
common a naive Bayes is

01:02:23.810 --> 01:02:29.090
this idea of additive smoothing, adding
some value alpha to each of the values

01:02:29.090 --> 01:02:31.943
in our distribution just to
smooth the data little bit.

01:02:31.943 --> 01:02:33.860
One such approach is
called Laplace smoothing,

01:02:33.860 --> 01:02:37.530
which basically just means adding one
to each value in our distribution.

01:02:37.530 --> 01:02:43.540
So if I have 100 samples and zero
of them contain the word grandson,

01:02:43.540 --> 01:02:45.290
well then I might say
that, you know what?

01:02:45.290 --> 01:02:49.460
Instead, let's pretend that I've had
one additional sample where the word

01:02:49.460 --> 01:02:53.210
grandson appeared and one additional
sample where the word grandson didn't

01:02:53.210 --> 01:02:53.840
appear.

01:02:53.840 --> 01:02:57.150
So I'll say all right,
now I have one 1 of 102--

01:02:57.150 --> 01:03:01.550
so one sample that does have the
word grandson out of 102 total.

01:03:01.550 --> 01:03:05.070
I'm basically creating two
samples that didn't exist before.

01:03:05.070 --> 01:03:08.830
But in doing so, I've been able to
smooth the distribution a little bit

01:03:08.830 --> 01:03:12.040
to make sure that I never have
to multiply anything by 0.

01:03:12.040 --> 01:03:17.080
By pretending I've seen one more value
in each category than I actually have,

01:03:17.080 --> 01:03:19.390
this gets us that result
of not having to worry

01:03:19.390 --> 01:03:22.180
about multiplying a number by 0.

01:03:22.180 --> 01:03:24.580
So this then is an approach
that we can use in order

01:03:24.580 --> 01:03:27.670
to try and apply naive
Bayes, even in situations

01:03:27.670 --> 01:03:31.730
where we're dealing with words that we
might not necessarily have seen before.

01:03:31.730 --> 01:03:35.140
And let's now take a look at how we
could actually apply that in practice.

01:03:35.140 --> 01:03:38.490
It turns out that NLTK, in addition
to having the ability to extract

01:03:38.490 --> 01:03:41.110
n-grams and tokenize
things into words, also

01:03:41.110 --> 01:03:45.400
has the ability to be able to apply
naive Bayes on some samples of text,

01:03:45.400 --> 01:03:46.920
for example.

01:03:46.920 --> 01:03:48.430
And so let's go ahead and do that.

01:03:48.430 --> 01:03:52.840
What I've done is, inside of sentiment,
I've prepared a corpus of just

01:03:52.840 --> 01:03:55.997
know reviews that I've generated, but
you can imagine using real reviews.

01:03:55.997 --> 01:03:58.330
I just have a couple of
positive reviews-- it was great.

01:03:58.330 --> 01:03:58.873
So much fun.

01:03:58.873 --> 01:03:59.540
Would recommend.

01:03:59.540 --> 01:04:00.550
My grandson loved it.

01:04:00.550 --> 01:04:01.712
Those sorts of messages.

01:04:01.712 --> 01:04:04.420
And then I have a whole bunch of
negative reviews-- not worth it,

01:04:04.420 --> 01:04:07.190
kind of cheap, really bad,
didn't work the way we expected--

01:04:07.190 --> 01:04:08.470
just one on each line.

01:04:08.470 --> 01:04:11.860
A whole bunch of positive
reviews and negative reviews.

01:04:11.860 --> 01:04:15.130
And what I'd like to do now
is analyze them somehow.

01:04:15.130 --> 01:04:19.690
So here then is sentiment up high,
and what we're going to do first

01:04:19.690 --> 01:04:23.680
is extract all of the positive
and negative sentences,

01:04:23.680 --> 01:04:28.600
create a set of all of the words that
were used across all of the messages,

01:04:28.600 --> 01:04:33.340
and then we're going to go ahead and
train NLTK's naive Bayes classifier

01:04:33.340 --> 01:04:34.810
on all of this training data.

01:04:34.810 --> 01:04:36.850
And with the training
data effectively is is I

01:04:36.850 --> 01:04:40.300
take all of the positive messages
and give them the label positive, all

01:04:40.300 --> 01:04:42.790
the negative messages and
give them the label negative,

01:04:42.790 --> 01:04:45.880
and then I'll go ahead and apply
this classifier to it, where I'd say,

01:04:45.880 --> 01:04:48.100
I would like to take all
of this training data

01:04:48.100 --> 01:04:52.030
and now have the ability to
classify it as positive or negative.

01:04:52.030 --> 01:04:53.860
I'll then take some input from the user.

01:04:53.860 --> 01:04:56.890
They can just type in
some sequence of words.

01:04:56.890 --> 01:04:59.020
And then I would like to
classify that sequence

01:04:59.020 --> 01:05:01.450
as either positive or
negative, and then I'll

01:05:01.450 --> 01:05:04.482
go ahead and print out what the
probabilities of each happened to be.

01:05:04.482 --> 01:05:07.690
And there are some helper functions here
that just organize things in the way

01:05:07.690 --> 01:05:09.610
that NLTK is expecting them to be.

01:05:09.610 --> 01:05:12.307
But the key idea here is that
I'm taking the positive messages,

01:05:12.307 --> 01:05:14.140
labeling them, taking
the negative messages,

01:05:14.140 --> 01:05:16.840
labeling them, putting them
inside of a classifier,

01:05:16.840 --> 01:05:21.380
and then now trying to classify
some new text that comes about.

01:05:21.380 --> 01:05:23.030
So let's go ahead and try it.

01:05:23.030 --> 01:05:26.740
I'll go ahead and go into sentiment,
and we'll run Python sentiment,

01:05:26.740 --> 01:05:29.328
passing in as input that
corpus that contains

01:05:29.328 --> 01:05:31.120
all of the positive
and negative messages--

01:05:31.120 --> 01:05:34.480
because depending on the corpus, that's
going to affect the probabilities.

01:05:34.480 --> 01:05:36.970
The effectiveness of
our ability to classify

01:05:36.970 --> 01:05:41.045
is entirely dependent on how good our
data is, and how much data we have,

01:05:41.045 --> 01:05:42.670
and how well they happen to be labeled.

01:05:42.670 --> 01:05:44.640
So now I can try something and say--

01:05:44.640 --> 01:05:47.170
let's try a review
like, this was great--

01:05:47.170 --> 01:05:49.800
just some review that I might leave.

01:05:49.800 --> 01:05:53.200
And it seems that, all right,
there is a 96% chance it estimates

01:05:53.200 --> 01:05:54.930
that this was a positive message--

01:05:54.930 --> 01:05:58.480
4% chance that it was a negative,
likely because the word great

01:05:58.480 --> 01:06:00.610
shows up inside of
the positive messages,

01:06:00.610 --> 01:06:03.080
but doesn't show up inside
of the negative messages.

01:06:03.080 --> 01:06:06.160
And that might be something that
our AI is able to capitalize on.

01:06:06.160 --> 01:06:09.640
And really, what it's going to look
for are the differentiating words--

01:06:09.640 --> 01:06:12.490
that if the probability
of words like this and was

01:06:12.490 --> 01:06:15.530
and is pretty similar between
positive and negative words,

01:06:15.530 --> 01:06:17.680
then the naive Bayes
classifier isn't going

01:06:17.680 --> 01:06:21.202
to end up using those values as
having some sort of importance

01:06:21.202 --> 01:06:21.910
in the algorithm.

01:06:21.910 --> 01:06:23.710
Because if they're the
same on both sides,

01:06:23.710 --> 01:06:26.560
you multiply that value for
both positive and negative,

01:06:26.560 --> 01:06:28.270
you end up getting about the same thing.

01:06:28.270 --> 01:06:30.730
What ultimately makes the
difference in naive Bayes

01:06:30.730 --> 01:06:34.210
is when you multiply by value
that's much bigger for one category

01:06:34.210 --> 01:06:36.880
than for another category--
when one word like great

01:06:36.880 --> 01:06:39.910
is much more likely to show
up in one type of message

01:06:39.910 --> 01:06:41.260
than another type of message.

01:06:41.260 --> 01:06:43.385
And that's one of the nice
things about naive Bayes

01:06:43.385 --> 01:06:45.250
is that, without me
telling it, that great

01:06:45.250 --> 01:06:48.210
is more important to care
about than this or was.

01:06:48.210 --> 01:06:50.380
Naive Bayes can figure
that out based on the data.

01:06:50.380 --> 01:06:53.740
It can figure out that this shows
up about the same amount of time

01:06:53.740 --> 01:06:56.560
between the two, but great,
that is a discriminator,

01:06:56.560 --> 01:07:00.060
a word that can be different
between the two types of messages.

01:07:00.060 --> 01:07:01.400
So I could try it again--

01:07:01.400 --> 01:07:04.583
type in a sentence like,
lots of fun, for example.

01:07:04.583 --> 01:07:06.250
This one it's a little less sure about--

01:07:06.250 --> 01:07:10.690
62% chance that it's positive, 37%
chance that it's negative-- maybe

01:07:10.690 --> 01:07:12.720
because there aren't
as clear discriminators

01:07:12.720 --> 01:07:15.310
or differentiators inside of this data.

01:07:15.310 --> 01:07:16.400
I'll try one more--

01:07:16.400 --> 01:07:20.430
say kind of overpriced.

01:07:20.430 --> 01:07:23.633
And all right, now
95%, 96% sure that this

01:07:23.633 --> 01:07:25.800
is a negative sentiment--
likely because of the word

01:07:25.800 --> 01:07:29.032
overpriced, because it's shown up
in a negative sentiment expression

01:07:29.032 --> 01:07:31.740
before, and therefore, it thinks,
you know what, this is probably

01:07:31.740 --> 01:07:34.720
going to be a negative sentence.

01:07:34.720 --> 01:07:37.830
And so naive Bayes has now given
us the ability to classify text.

01:07:37.830 --> 01:07:40.350
Given enough training data,
given enough examples,

01:07:40.350 --> 01:07:44.400
we can train our AI to be able to
look at natural language, human words,

01:07:44.400 --> 01:07:46.410
figure out which words
are likely to show up

01:07:46.410 --> 01:07:48.870
in positive as opposed to
negative sentiment messages,

01:07:48.870 --> 01:07:50.670
and categorize them accordingly.

01:07:50.670 --> 01:07:52.420
And you could imagine
doing the same thing

01:07:52.420 --> 01:07:55.170
anytime you want to take text
and group it into categories.

01:07:55.170 --> 01:07:58.300
If I want to take an email
and categorize as email--

01:07:58.300 --> 01:08:01.560
as a good email or as a spam email,
you could apply a similar idea.

01:08:01.560 --> 01:08:04.020
Try and look for the
discriminating words,

01:08:04.020 --> 01:08:07.230
the words that make it more
likely to be a spam email or not,

01:08:07.230 --> 01:08:10.830
and just train a naive Bayes
classifier to be able to figure out

01:08:10.830 --> 01:08:14.250
what that distribution is and to be
able to figure out how to categorize

01:08:14.250 --> 01:08:15.978
an email as good or as spam.

01:08:15.978 --> 01:08:19.020
Now, of course, it's not going to be
able to give us a definitive answer.

01:08:19.020 --> 01:08:22.950
It gives us a probability
distribution, something like 63%

01:08:22.950 --> 01:08:25.380
positive, 37% negative.

01:08:25.380 --> 01:08:29.550
And that might be why our spam filters
and our emails sometimes make mistakes,

01:08:29.550 --> 01:08:32.700
sometimes think that a good
email is actually spam or vice

01:08:32.700 --> 01:08:36.000
versa, because ultimately,
the best that it can do

01:08:36.000 --> 01:08:37.890
is calculate a probability distribution.

01:08:37.890 --> 01:08:40.290
If natural language is
ambiguous, we can usually

01:08:40.290 --> 01:08:42.960
just deal in the world of
probabilities to try and get

01:08:42.960 --> 01:08:47.100
an answer that is reasonably good, even
if we aren't able to guarantee for sure

01:08:47.100 --> 01:08:50.970
that it is the number that we
actually expect for it to be.

01:08:50.970 --> 01:08:54.600
That then was a look at how
we can begin to take some text

01:08:54.600 --> 01:08:59.910
and to be able to analyze the text and
group it into some sorts of categories.

01:08:59.910 --> 01:09:04.140
But ultimately, in addition just being
able to analyze text and categorize it,

01:09:04.140 --> 01:09:08.130
we'd like to be able to figure
out information about the text,

01:09:08.130 --> 01:09:11.130
get it some sort of meaning
out of the text as well.

01:09:11.130 --> 01:09:13.500
And this starts to get us
in the world of information,

01:09:13.500 --> 01:09:16.620
of being able to try and
take data in the form of text

01:09:16.620 --> 01:09:18.450
and retrieve information from it.

01:09:18.450 --> 01:09:22.500
So one type of problem is known
as information retrieval, or IR,

01:09:22.500 --> 01:09:26.979
which is the task of finding relevant
documents in response to a query.

01:09:26.979 --> 01:09:30.330
So this is something like you type
in a query into a search engine,

01:09:30.330 --> 01:09:32.279
like Google, or you're
typing in something

01:09:32.279 --> 01:09:35.640
into some system that's going to look
for-- inside of a library catalog,

01:09:35.640 --> 01:09:38.609
for example-- that's going to
look for responses to a query.

01:09:38.609 --> 01:09:43.217
I want to look for documents that are
about the US constitution or something,

01:09:43.217 --> 01:09:45.300
and I would like to get a
whole bunch of documents

01:09:45.300 --> 01:09:47.819
that match that query back to me.

01:09:47.819 --> 01:09:50.819
But you might imagine that what
I really want to be able to do

01:09:50.819 --> 01:09:53.160
is, in order to solve
this task effectively,

01:09:53.160 --> 01:09:55.830
I need to be able to take
documents and figure out,

01:09:55.830 --> 01:09:57.870
what are those documents about?

01:09:57.870 --> 01:10:01.680
I want to be able to say what is it
that these particular documents are

01:10:01.680 --> 01:10:03.900
about-- what of the topics
of those documents--

01:10:03.900 --> 01:10:08.160
so that I can then more effectively
be able to retrieve information

01:10:08.160 --> 01:10:10.050
from those particular documents.

01:10:10.050 --> 01:10:13.560
And this refers to a set of tasks
generally known as topic modeling,

01:10:13.560 --> 01:10:17.918
where I'd like to discover what the
topics are for a set of documents.

01:10:17.918 --> 01:10:19.710
And this is something
that humans could do.

01:10:19.710 --> 01:10:21.800
A human could read a document
and tell you, all right,

01:10:21.800 --> 01:10:23.883
here's what this document
is about, and give maybe

01:10:23.883 --> 01:10:27.862
a couple of topics for who are the
important people in this document, what

01:10:27.862 --> 01:10:30.570
are the important objects in the
document-- can probably tell you

01:10:30.570 --> 01:10:32.370
that kind of thing.

01:10:32.370 --> 01:10:35.160
But we'd like for our AI to
be able to do the same thing.

01:10:35.160 --> 01:10:38.760
Given some document, can you
tell me what the important words

01:10:38.760 --> 01:10:39.870
in this document are?

01:10:39.870 --> 01:10:42.095
What are the words that
set this document apart

01:10:42.095 --> 01:10:44.220
that I might care about if
I'm looking at documents

01:10:44.220 --> 01:10:47.128
based on keywords, for example?

01:10:47.128 --> 01:10:49.920
And so one instinctive idea-- an
intuitive idea that probably makes

01:10:49.920 --> 01:10:50.580
sense--

01:10:50.580 --> 01:10:53.250
is let's just use term frequency.

01:10:53.250 --> 01:10:56.100
Term frequency is just
defined as the number of times

01:10:56.100 --> 01:10:58.650
a particular term appears in a document.

01:10:58.650 --> 01:11:03.300
If I have a document with 100 words and
one particular word shows up 10 times,

01:11:03.300 --> 01:11:05.440
it has a term frequency of 10.

01:11:05.440 --> 01:11:06.690
It shows up pretty often.

01:11:06.690 --> 01:11:09.000
Maybe that's going to
be an important word.

01:11:09.000 --> 01:11:10.750
And sometimes, you'll
also see this framed

01:11:10.750 --> 01:11:14.620
as a proportion of the total number
of words, so 10 words out of 100.

01:11:14.620 --> 01:11:19.110
Maybe it has a term frequency of
0.1, meaning 10% of all of the words

01:11:19.110 --> 01:11:21.530
are this particular
word that I care about.

01:11:21.530 --> 01:11:23.280
Ultimately, that doesn't
change relatively

01:11:23.280 --> 01:11:26.300
how important they are for
any one particular document,

01:11:26.300 --> 01:11:27.730
but they're the same idea.

01:11:27.730 --> 01:11:31.050
The idea is look for words that show
up more frequently, because those

01:11:31.050 --> 01:11:35.970
are more likely to be the important
words inside of a corpus of documents.

01:11:35.970 --> 01:11:37.840
And so let's go ahead
and give that a try.

01:11:37.840 --> 01:11:40.980
Let's say I wanted to find out what
the Sherlock Holmes stories are about.

01:11:40.980 --> 01:11:42.780
I have a whole bunch of
Sherlock Holmes stories

01:11:42.780 --> 01:11:45.000
and I want to know, in
general, what are they about?

01:11:45.000 --> 01:11:47.708
What are the important characters?

01:11:47.708 --> 01:11:49.000
What are the important objects?

01:11:49.000 --> 01:11:52.170
What are the important parts of
the story, just in terms of words?

01:11:52.170 --> 01:11:55.350
And I'd like for the AI to be able
to figure that out on its own,

01:11:55.350 --> 01:11:57.660
and we'll do so by looking
at term frequency--

01:11:57.660 --> 01:12:01.930
by looking at, what are the words
that show up the most often?

01:12:01.930 --> 01:12:06.250
So we'll go ahead, and I'll go ahead
and go in to the tfidf directory.

01:12:06.250 --> 01:12:08.350
You'll see why it's
called that in a moment.

01:12:08.350 --> 01:12:14.290
But let's first open up tf0.py, which
is going to calculate the top 10 term

01:12:14.290 --> 01:12:17.092
frequencies-- or maybe
top five term frequencies

01:12:17.092 --> 01:12:19.300
for a corpus of documents,
a whole bunch of documents

01:12:19.300 --> 01:12:22.930
where each document is just
a story from Sherlock Holmes.

01:12:22.930 --> 01:12:26.772
We're going to load all
the data into our corpus

01:12:26.772 --> 01:12:29.850
and we're going to figure out,
what are all of the words that

01:12:29.850 --> 01:12:32.610
show up inside of that corpus?

01:12:32.610 --> 01:12:35.187
And we're going to
basically just assemble all

01:12:35.187 --> 01:12:36.770
of the number of the term frequencies.

01:12:36.770 --> 01:12:39.510
We're going to calculate, how
often do each of these terms

01:12:39.510 --> 01:12:41.880
appear inside of the document?

01:12:41.880 --> 01:12:43.368
And we'll print out the top five.

01:12:43.368 --> 01:12:45.660
And so there are some data
structures involved that you

01:12:45.660 --> 01:12:47.160
can take a look at if you'd like to.

01:12:47.160 --> 01:12:50.550
The exact code is not so important,
but it is the idea of what we're doing.

01:12:50.550 --> 01:12:54.450
We're taking each of these
documents and first sorting them.

01:12:54.450 --> 01:12:56.340
We're saying, take all
the words that show up

01:12:56.340 --> 01:13:00.080
and sort them by how
often each word shows up.

01:13:00.080 --> 01:13:04.710
And let's go ahead and just, for
each document, save the top five

01:13:04.710 --> 01:13:07.720
terms that happen to show up
in each of those documents.

01:13:07.720 --> 01:13:10.900
So again, some helper functions you can
take a look at if you're interested.

01:13:10.900 --> 01:13:13.440
But the key idea here is
that all we're going to do

01:13:13.440 --> 01:13:18.240
is run to tf0 on the
Sherlock Holmes stories.

01:13:18.240 --> 01:13:21.840
And what I'm hoping to get out of this
process is I am hoping to figure out,

01:13:21.840 --> 01:13:25.150
what are the important words in
Sherlock Holmes, for example?

01:13:25.150 --> 01:13:29.370
So we'll go ahead and run
this and see what we get.

01:13:29.370 --> 01:13:30.982
And it's loading the data.

01:13:30.982 --> 01:13:31.940
And here's what we get.

01:13:31.940 --> 01:13:36.530
For this particular story, the
important words are the, and and, and I,

01:13:36.530 --> 01:13:37.368
and to, and of.

01:13:37.368 --> 01:13:39.410
Those are the words that
show up more frequently.

01:13:39.410 --> 01:13:45.000
In this particular story, it's
the, and and, and I, and a, and of.

01:13:45.000 --> 01:13:47.000
This is not particularly useful to us.

01:13:47.000 --> 01:13:48.230
We're using term frequencies.

01:13:48.230 --> 01:13:50.930
We're looking at what words show
up the most frequently in each

01:13:50.930 --> 01:13:54.830
of these various different
documents, but what we get naturally

01:13:54.830 --> 01:13:57.470
are just the words that
show up a lot in English.

01:13:57.470 --> 01:14:00.385
The word the, and of, and happen
to show up a lot in English,

01:14:00.385 --> 01:14:02.510
and therefore, they happen
to show up a lot in each

01:14:02.510 --> 01:14:04.052
of these various different documents.

01:14:04.052 --> 01:14:06.320
This is not a particularly
useful metric for us

01:14:06.320 --> 01:14:08.690
to be able to analyze
what words are important,

01:14:08.690 --> 01:14:12.960
because these words are just part of
the grammatical structure of English.

01:14:12.960 --> 01:14:17.610
And it turns out we can categorize words
into a couple of different categories.

01:14:17.610 --> 01:14:21.102
These words happen to be known as what
we might call function words, words

01:14:21.102 --> 01:14:23.060
that have little meaning
on their own, but that

01:14:23.060 --> 01:14:26.100
are used to grammatically connect
different parts of a sentence.

01:14:26.100 --> 01:14:29.120
These are words like am, and
by, and do, and is, and which,

01:14:29.120 --> 01:14:32.130
and with, and yet-- words that,
on their own, what do they mean?

01:14:32.130 --> 01:14:33.140
It's hard to say.

01:14:33.140 --> 01:14:35.390
They get their meaning
from how they connect

01:14:35.390 --> 01:14:36.980
different parts of the sentence.

01:14:36.980 --> 01:14:40.610
And these function words are what we
might call a closed class of words

01:14:40.610 --> 01:14:41.990
in a language like English.

01:14:41.990 --> 01:14:44.690
There's really just some
fixed list of function words,

01:14:44.690 --> 01:14:46.190
and they don't change very often.

01:14:46.190 --> 01:14:48.260
There's just some list of
words that are commonly

01:14:48.260 --> 01:14:52.460
used to connect other grammatical
structures in the language.

01:14:52.460 --> 01:14:56.120
And that's in contrast with what
we might call content words, words

01:14:56.120 --> 01:14:58.970
that carry meaning independently--
words like algorithm,

01:14:58.970 --> 01:15:02.580
category, computer, words that
actually have some sort of meaning.

01:15:02.580 --> 01:15:05.150
And these are usually the
words that we care about.

01:15:05.150 --> 01:15:07.250
These are the words where
we want to figure out,

01:15:07.250 --> 01:15:10.020
what are the important
words in our document?

01:15:10.020 --> 01:15:12.230
We probably care about
the content words more

01:15:12.230 --> 01:15:15.380
than we care about the function words.

01:15:15.380 --> 01:15:20.770
And so one strategy we could apply is
just ignore all of the function words.

01:15:20.770 --> 01:15:26.120
So here in tf1.py, I've
done the same exact thing,

01:15:26.120 --> 01:15:31.790
except I'm going to load a whole bunch
of words from a function_words.txt

01:15:31.790 --> 01:15:35.670
file, inside of which are just a whole
bunch of function words in alphabetical

01:15:35.670 --> 01:15:36.170
order.

01:15:36.170 --> 01:15:38.570
These are just a whole
bunch of function words

01:15:38.570 --> 01:15:41.870
that are just words that are used
to connect other words in English,

01:15:41.870 --> 01:15:44.275
and someone has just compiled
this particular list.

01:15:44.275 --> 01:15:46.400
And these are the words
that I just want to ignore.

01:15:46.400 --> 01:15:49.790
If any of these words-- let's just
ignore it as one of the top terms,

01:15:49.790 --> 01:15:52.790
because these are not words
that I probably care about

01:15:52.790 --> 01:15:56.570
if I want to analyze what the
important terms inside of a document

01:15:56.570 --> 01:15:57.860
happen to be.

01:15:57.860 --> 01:16:01.820
So in tfidf1, we were
ultimately doing is,

01:16:01.820 --> 01:16:05.360
if the word is in my
set of function words,

01:16:05.360 --> 01:16:08.720
I'm just going to skip over it, just
ignore any of the function words

01:16:08.720 --> 01:16:11.210
by continuing on to
the next word and then

01:16:11.210 --> 01:16:14.010
just calculating the frequencies
for those words instead.

01:16:14.010 --> 01:16:16.520
So I'm going to pretend the
function words aren't there,

01:16:16.520 --> 01:16:19.550
and now maybe I can get
a better sense for what

01:16:19.550 --> 01:16:23.060
terms are important in each of the
various different Sherlock Holmes

01:16:23.060 --> 01:16:24.560
stories.

01:16:24.560 --> 01:16:29.080
So now let's run tf1 on the Sherlock
Holmes corpus and see what we get now.

01:16:29.080 --> 01:16:32.510
And let's look at, what is the most
important term in each of the stories?

01:16:32.510 --> 01:16:34.760
Well, it seems like,
for each of the stories,

01:16:34.760 --> 01:16:36.770
the most important word is Holmes.

01:16:36.770 --> 01:16:38.270
I guess that's what we would expect.

01:16:38.270 --> 01:16:39.380
They're all Sherlock Holmes stories.

01:16:39.380 --> 01:16:40.922
And Holmes is not a function in Word.

01:16:40.922 --> 01:16:44.360
It's not the, or a, or
an, so it wasn't ignored.

01:16:44.360 --> 01:16:46.130
But Holmes and man--

01:16:46.130 --> 01:16:50.760
these are probably not what I mean when
I say, what are the important words?

01:16:50.760 --> 01:16:52.700
Even though Holmes does
show up the most often

01:16:52.700 --> 01:16:54.890
it's not giving me a whole
lot of information here

01:16:54.890 --> 01:16:57.800
about what each of the different
Sherlock Holmes stories

01:16:57.800 --> 01:16:59.460
are actually about.

01:16:59.460 --> 01:17:02.880
And the reason why is because Sherlock
Holmes shows up in all the stories,

01:17:02.880 --> 01:17:06.950
and so it's not meaningful for me to
say that this story is about Sherlock

01:17:06.950 --> 01:17:09.560
Holmes I want to try and
figure out the different topics

01:17:09.560 --> 01:17:11.180
across the corpus of documents.

01:17:11.180 --> 01:17:13.640
What I really want to know
is, what words show up

01:17:13.640 --> 01:17:18.170
in this document that show up less
frequently in the other documents,

01:17:18.170 --> 01:17:19.380
for example?

01:17:19.380 --> 01:17:22.730
And so to get at that idea, we're
going to introduce the notion

01:17:22.730 --> 01:17:25.850
of inverse document frequency.

01:17:25.850 --> 01:17:29.450
Inverse document frequency
is a measure of how common,

01:17:29.450 --> 01:17:33.530
or rare, a word happens to be
across an entire corpus of words.

01:17:33.530 --> 01:17:35.960
And mathematically, it's
usually calculated like this--

01:17:35.960 --> 01:17:39.440
as the logarithm of the
total number of documents

01:17:39.440 --> 01:17:43.550
divided by the number of
documents containing the word.

01:17:43.550 --> 01:17:47.510
So if a word like Holmes shows
up in all of the documents,

01:17:47.510 --> 01:17:50.870
well, then total documents
is how many documents there

01:17:50.870 --> 01:17:55.110
are a number of documents containing
Holmes is going to be the same number.

01:17:55.110 --> 01:17:58.760
So when you divide these two together,
you'll get 1, and the logarithm of one

01:17:58.760 --> 01:18:00.460
is just 0.

01:18:00.460 --> 01:18:04.370
And so what we get is, if Holmes
shows up in all of the documents,

01:18:04.370 --> 01:18:07.040
it has an inverse
document frequency of 0.

01:18:07.040 --> 01:18:09.560
And you can think now of
inverse document frequency

01:18:09.560 --> 01:18:13.370
as a measure of how
rare is the word that

01:18:13.370 --> 01:18:16.280
shows up in this particular document
that if a word doesn't show up

01:18:16.280 --> 01:18:21.060
across many documents at all this
number is going to be much higher.

01:18:21.060 --> 01:18:24.710
And this then gets us that
a model known as tf-idf,

01:18:24.710 --> 01:18:28.310
which is a method for ranking what
words are important in the document

01:18:28.310 --> 01:18:30.440
by multiplying these two ideas together.

01:18:30.440 --> 01:18:37.190
Multiply term frequency, or TF, by
inverse document frequency, or IDF,

01:18:37.190 --> 01:18:39.890
where the idea here now is
that how important a word is

01:18:39.890 --> 01:18:41.540
depends on two things.

01:18:41.540 --> 01:18:44.197
It depends on how often it
shows up in the document using

01:18:44.197 --> 01:18:46.280
the heuristic that, if a
word shows up more often,

01:18:46.280 --> 01:18:47.900
it's probably more important.

01:18:47.900 --> 01:18:51.170
And we multiply that by
inverse document frequency IDF,

01:18:51.170 --> 01:18:54.900
because if the word is rarer,
but it shows up in the document,

01:18:54.900 --> 01:18:57.200
it's probably more important
than if the word shows up

01:18:57.200 --> 01:19:00.200
across most or all of the documents,
because then it's probably

01:19:00.200 --> 01:19:02.990
a less important factor in
what the different topics

01:19:02.990 --> 01:19:06.840
across the different documents
in the corpus happen to be.

01:19:06.840 --> 01:19:11.060
And so now let's go ahead and apply
this algorithm on the Sherlock Holmes

01:19:11.060 --> 01:19:13.340
corpus.

01:19:13.340 --> 01:19:15.650
And here's tfidf.

01:19:15.650 --> 01:19:18.860
Now what I'm doing is,
for each of the documents,

01:19:18.860 --> 01:19:22.120
for each word, I'm
calculating its TF score,

01:19:22.120 --> 01:19:25.160
term frequency, multiplied
by the inverse document

01:19:25.160 --> 01:19:28.190
frequency of that word-- not just
looking at the single volume,

01:19:28.190 --> 01:19:30.410
but multiplying these
two values together

01:19:30.410 --> 01:19:33.650
in order to compute the overall values.

01:19:33.650 --> 01:19:37.610
And now, if I run tfidf
on the Holmes corpus,

01:19:37.610 --> 01:19:40.615
this is going to try and get us
a better approximation for what's

01:19:40.615 --> 01:19:41.990
important in each of the stories.

01:19:41.990 --> 01:19:44.000
And it seems like it's
trying to extract here

01:19:44.000 --> 01:19:46.280
probably like the names
of characters that

01:19:46.280 --> 01:19:49.010
happen to be important in the
story-- characters that show up

01:19:49.010 --> 01:19:51.380
in this story that don't
show up in the other story--

01:19:51.380 --> 01:19:53.930
and prioritizing the more
important characters that

01:19:53.930 --> 01:19:56.510
happen to show up more often.

01:19:56.510 --> 01:20:00.170
And so this then might be a better
analysis of what types of topics

01:20:00.170 --> 01:20:02.070
are more or less important.

01:20:02.070 --> 01:20:05.330
I also have another corpus, which
is a corpus of all of the Federalist

01:20:05.330 --> 01:20:07.700
Papers from American history.

01:20:07.700 --> 01:20:11.240
If I go ahead and run tfidf
on the Federalist Papers,

01:20:11.240 --> 01:20:14.330
we can begin to see what
the important words in each

01:20:14.330 --> 01:20:16.910
of the various different
Federalist Papers happen to be--

01:20:16.910 --> 01:20:22.070
that in Federalist Paper Number 61,
seems like it's a lot about elections.

01:20:22.070 --> 01:20:25.350
In Federalist Papers 66, but
the Senate and impeachments.

01:20:25.350 --> 01:20:28.470
You can start to extract what
the important terms and what

01:20:28.470 --> 01:20:32.540
the important words are just by
looking at what things show up across--

01:20:32.540 --> 01:20:34.800
and don't show up across
many of the documents,

01:20:34.800 --> 01:20:38.637
but show up frequently enough
in certain of the documents.

01:20:38.637 --> 01:20:40.470
And so this can be a
helpful tool for trying

01:20:40.470 --> 01:20:43.350
to figure out this
kind of topic modeling,

01:20:43.350 --> 01:20:47.100
figuring out what it is that
a particular document happens

01:20:47.100 --> 01:20:48.620
to be about.

01:20:48.620 --> 01:20:53.070
And so this then is starting to get
us into this world of semantics,

01:20:53.070 --> 01:20:56.880
what it is that things actually mean
when we're talking about language.

01:20:56.880 --> 01:20:59.100
Now, we're not going to
think about the bag of words,

01:20:59.100 --> 01:21:02.670
where we just say, treat a sample of
text as just a whole bunch of words.

01:21:02.670 --> 01:21:04.320
And we don't care about the order.

01:21:04.320 --> 01:21:06.870
Now, when we get into
the world of semantics,

01:21:06.870 --> 01:21:10.750
we really do start to care about what
it is that these words actually mean,

01:21:10.750 --> 01:21:12.850
how it is these words
relate to each other,

01:21:12.850 --> 01:21:17.250
and in particular, how we can
extract information out of that text.

01:21:17.250 --> 01:21:20.970
Information extraction is
somehow extracting knowledge

01:21:20.970 --> 01:21:23.970
from our documents-- figuring
out, given a whole bunch of text,

01:21:23.970 --> 01:21:28.140
can we automate the process of having
an AI, look at those documents,

01:21:28.140 --> 01:21:31.710
and get out what the useful or relevant
knowledge inside those documents

01:21:31.710 --> 01:21:33.190
happens to be?

01:21:33.190 --> 01:21:34.950
So let's take a look at an example.

01:21:34.950 --> 01:21:37.415
I'll give you two samples
from news articles.

01:21:37.415 --> 01:21:40.290
Here up above is a sample of a news
article from the Harvard Business

01:21:40.290 --> 01:21:42.310
Review that was about Facebook.

01:21:42.310 --> 01:21:45.630
Down below is an example of a
Business Insider article from 2018

01:21:45.630 --> 01:21:47.550
that was about Amazon.

01:21:47.550 --> 01:21:49.710
And there's some information
here that we might

01:21:49.710 --> 01:21:51.570
want an AI to be able to extract--

01:21:51.570 --> 01:21:54.030
information, knowledge
about these companies

01:21:54.030 --> 01:21:55.670
that we might want to extract.

01:21:55.670 --> 01:21:58.020
And in particular, what I
might want to extract is--

01:21:58.020 --> 01:22:02.260
let's say I want to know data
about when companies were founded--

01:22:02.260 --> 01:22:05.250
that I wanted to know that
Facebook was founded in 2004,

01:22:05.250 --> 01:22:07.190
Amazon founded in 1994--

01:22:07.190 --> 01:22:10.500
that that is important information
that I happen to care about.

01:22:10.500 --> 01:22:13.110
Well, how do we extract that
information from the text?

01:22:13.110 --> 01:22:15.660
What is my way of being
able to understand this text

01:22:15.660 --> 01:22:18.810
and figure out, all right,
Facebook was founded in 2004?

01:22:18.810 --> 01:22:22.710
Well, what I can look for are
templates or patterns, things

01:22:22.710 --> 01:22:26.700
that happened to show up across multiple
different documents that give me

01:22:26.700 --> 01:22:28.922
some sense for what this
knowledge happens to mean.

01:22:28.922 --> 01:22:30.630
And what we'll notice
is a common pattern

01:22:30.630 --> 01:22:34.500
between both of these passages,
which is this phrasing here.

01:22:34.500 --> 01:22:37.890
When Facebook was
founded in 2004, comma--

01:22:37.890 --> 01:22:42.360
and then down below, when Amazon
was founded in 1994, comma.

01:22:42.360 --> 01:22:47.640
And those two templates end up giving
us a mechanism for trying to extract

01:22:47.640 --> 01:22:53.220
information-- that this notion, when
company was founded in year comma,

01:22:53.220 --> 01:22:56.310
this can tell us something about
when a company was founded,

01:22:56.310 --> 01:22:58.820
because if we set our
AI loose on the web,

01:22:58.820 --> 01:23:01.530
let look at a whole bunch of papers
or a whole bunch of articles,

01:23:01.530 --> 01:23:03.360
and it finds this pattern--

01:23:03.360 --> 01:23:06.930
when blank was founded in blank, comma--

01:23:06.930 --> 01:23:09.840
well, then our AI can
pretty reasonably conclude

01:23:09.840 --> 01:23:13.740
that there's a good chance that this
is going to be like some company,

01:23:13.740 --> 01:23:17.470
and this is going to be like the year
that company was founded, for example--

01:23:17.470 --> 01:23:20.907
might not be perfect, but at
least it's a good heuristic.

01:23:20.907 --> 01:23:22.740
And so you might imagine
that, if you wanted

01:23:22.740 --> 01:23:25.650
to train and AI to be able
to look for information,

01:23:25.650 --> 01:23:27.810
you might give the AI
templates like this--

01:23:27.810 --> 01:23:31.200
not only give it a template like when
company blank was founded in blank,

01:23:31.200 --> 01:23:34.710
but give it like, the book blank
was written by blank, for example.

01:23:34.710 --> 01:23:37.500
Just give it some templates
where it can search the web,

01:23:37.500 --> 01:23:41.640
search a whole big corpus of documents,
looking for templates that match that,

01:23:41.640 --> 01:23:44.970
and if it finds that, then
it's able to figure out,

01:23:44.970 --> 01:23:47.370
all right, here's the
company and here's the year.

01:23:47.370 --> 01:23:50.250
But of course, that requires
us to write these templates.

01:23:50.250 --> 01:23:53.547
It requires us to figure out, what
is the structure of this information

01:23:53.547 --> 01:23:54.630
likely going to look like?

01:23:54.630 --> 01:23:56.190
And it might be difficult to know.

01:23:56.190 --> 01:23:58.500
The different websites are, of
course, going to do this differently.

01:23:58.500 --> 01:24:01.830
This type of method isn't going to be
able to extract all of the information,

01:24:01.830 --> 01:24:04.170
because if the words are
slightly in a different order,

01:24:04.170 --> 01:24:06.840
it won't match on that
particular template.

01:24:06.840 --> 01:24:11.310
But one thing we can do is, rather
than give our AI the template,

01:24:11.310 --> 01:24:13.290
we can give AI the data.

01:24:13.290 --> 01:24:19.540
We can tell the AI, Facebook was founded
in 2004 and Amazon was founded in 1994,

01:24:19.540 --> 01:24:22.440
and just tell the AI those
two pieces of information,

01:24:22.440 --> 01:24:24.780
and then set the AI loose on the web.

01:24:24.780 --> 01:24:30.030
And now the ideas that the AI can begin
to look for, where do Facebook in 2004

01:24:30.030 --> 01:24:33.150
show up together, where do
Amazon in 1994 show up together,

01:24:33.150 --> 01:24:36.150
and it can discover these
templates for itself.

01:24:36.150 --> 01:24:38.580
It can discover that
this kind of phrasing--

01:24:38.580 --> 01:24:40.320
when blank was founded in blank--

01:24:40.320 --> 01:24:45.030
tends to relate Facebook to 2004,
and it released Amazon to 1994,

01:24:45.030 --> 01:24:49.320
so maybe it will hold the same
relation for others as well.

01:24:49.320 --> 01:24:51.572
And this ends up being--
this automated template

01:24:51.572 --> 01:24:54.030
generation ends up being quite
powerful, and we'll go ahead

01:24:54.030 --> 01:24:56.250
and take a look at that now as well.

01:24:56.250 --> 01:24:59.040
What I have here inside
of templates directory

01:24:59.040 --> 01:25:03.120
is a file called companies.csv,
and this is all of the data

01:25:03.120 --> 01:25:04.520
that I am going to give to my AI.

01:25:04.520 --> 01:25:09.000
I'm going to give it the pair
Amazon, 1994 and Facebook, 2004.

01:25:09.000 --> 01:25:11.190
And what I'm going to
tell my AI to do is

01:25:11.190 --> 01:25:14.010
search a corpus of
documents for other data--

01:25:14.010 --> 01:25:16.620
these pairs like this--
other relationships.

01:25:16.620 --> 01:25:18.990
I'm not telling AI that this
is a company and the date

01:25:18.990 --> 01:25:19.920
that it was founded.

01:25:19.920 --> 01:25:23.750
I'm just giving it Amazon,
1994 and Facebook, 2004

01:25:23.750 --> 01:25:25.550
and letting the AI do the rest.

01:25:25.550 --> 01:25:28.640
And what the AI is going to do is
it's going to look through my corpus--

01:25:28.640 --> 01:25:30.770
here's my corpus of documents--

01:25:30.770 --> 01:25:33.590
and it's going to find, like
inside of Business Insider,

01:25:33.590 --> 01:25:38.580
that we have sentences like, back when
Amazon was founded in 2004, comma--

01:25:38.580 --> 01:25:42.740
and that kind of phrasing is going to be
similar to this Harvard Business Review

01:25:42.740 --> 01:25:46.935
story that has a sentence like,
when Facebook was founded in 2004--

01:25:46.935 --> 01:25:49.310
and it's going to look across
a number of other documents

01:25:49.310 --> 01:25:53.820
for similar types of patterns to be able
to extract that kind of information.

01:25:53.820 --> 01:25:56.450
And what it will do is,
if I go ahead and run,

01:25:56.450 --> 01:25:58.660
I'll go ahead and go into templates.

01:25:58.660 --> 01:26:01.220
So I'll say python search.py.

01:26:01.220 --> 01:26:05.030
I'm going to look for the data
like the data and companies.csv

01:26:05.030 --> 01:26:08.690
inside of the company's directory, which
contains a whole bunch of news articles

01:26:08.690 --> 01:26:10.900
that I've curated in advance.

01:26:10.900 --> 01:26:12.080
And here's what I get--

01:26:12.080 --> 01:26:15.560
Google 1998, Apple
1976, Microsoft 1975--

01:26:15.560 --> 01:26:16.400
so on and so forth--

01:26:16.400 --> 01:26:18.470
Walmart 1962, for example.

01:26:18.470 --> 01:26:20.810
These are all of the pieces
of data that happened

01:26:20.810 --> 01:26:23.750
to match that same template that
we were able to find before.

01:26:23.750 --> 01:26:25.430
And how was it able to find this?

01:26:25.430 --> 01:26:29.460
Well, it's probably because, if
we look at the Forbes article,

01:26:29.460 --> 01:26:34.730
for example, that it has a phrase in it
like, when Walmart was founded in 1962,

01:26:34.730 --> 01:26:38.000
comma-- that it's able to
identify these sorts of patterns

01:26:38.000 --> 01:26:39.890
and extract information from them.

01:26:39.890 --> 01:26:42.650
Now, granted, I have curated
all these stories in advance

01:26:42.650 --> 01:26:46.130
in order to make sure that there
is data that it's able to match on.

01:26:46.130 --> 01:26:49.100
And in practice, it's not always
going to be in this exact format

01:26:49.100 --> 01:26:52.430
when you're seeing a company related
to the year in which it was founded,

01:26:52.430 --> 01:26:56.030
but if you give the AI access to enough
data-- like all of the data of text

01:26:56.030 --> 01:26:58.910
on the internet-- and just have
the AI crawl the internet looking

01:26:58.910 --> 01:27:02.720
for information, it can very
reliably, or with some probability,

01:27:02.720 --> 01:27:05.780
try and extract information
using these sorts of templates

01:27:05.780 --> 01:27:08.330
and be able to generate
interesting sorts of knowledge.

01:27:08.330 --> 01:27:10.940
And the more knowledge it
learns, the more new templates

01:27:10.940 --> 01:27:13.190
it's able to construct,
looking for constructions that

01:27:13.190 --> 01:27:15.930
show up in other locations as well.

01:27:15.930 --> 01:27:17.910
So let's take a look at another example.

01:27:17.910 --> 01:27:20.955
And then I'll here show
you presidents.csv,

01:27:20.955 --> 01:27:23.330
where I have two presidents
and their inauguration date--

01:27:23.330 --> 01:27:28.220
so George Washington 1789,
Barack Obama 2009 for example.

01:27:28.220 --> 01:27:31.430
And I also am going to give
to our AI a corpus that

01:27:31.430 --> 01:27:34.550
just contains a single
document, which is the Wikipedia

01:27:34.550 --> 01:27:37.880
article for the list of presidents
of the United States, for example--

01:27:37.880 --> 01:27:39.680
just information about presidents.

01:27:39.680 --> 01:27:45.147
And I'd like to extract from this raw
HTML document on a web page information

01:27:45.147 --> 01:27:45.980
about the president.

01:27:45.980 --> 01:27:50.460
So I can say search in presidents.csv.

01:27:50.460 --> 01:27:53.720
And what I get is a whole
bunch of data about presidents

01:27:53.720 --> 01:27:56.300
and what year they were likely
inaugurated and by looking

01:27:56.300 --> 01:27:58.010
for patterns that matched--

01:27:58.010 --> 01:28:00.180
Barack Obama 2009, for example--

01:28:00.180 --> 01:28:02.280
looking for these sorts
of patterns that happened

01:28:02.280 --> 01:28:07.287
to give us some clues as to what it
is that a story happens to be about.

01:28:07.287 --> 01:28:08.370
So here's another example.

01:28:08.370 --> 01:28:12.710
If I open up inside the olympics,
here is a scraped version

01:28:12.710 --> 01:28:15.050
of the Olympic home page
that has information

01:28:15.050 --> 01:28:16.610
about various different Olympics.

01:28:16.610 --> 01:28:20.360
And maybe I want to extract
Olympic locations and years

01:28:20.360 --> 01:28:21.980
from this particular page.

01:28:21.980 --> 01:28:24.950
Well, the way I can do that is
using the exact same algorithm.

01:28:24.950 --> 01:28:29.730
I'm just saying, all right, here are two
Olympics and where they were located--

01:28:29.730 --> 01:28:32.160
so 2012 London, for example.

01:28:32.160 --> 01:28:35.030
Let me go ahead and
just run this process,

01:28:35.030 --> 01:28:39.440
Python search, on olympics.csv,
look at all the Olympic data set,

01:28:39.440 --> 01:28:41.280
and here I get some information back.

01:28:41.280 --> 01:28:43.310
Now, this information--
not totally perfect.

01:28:43.310 --> 01:28:45.530
There are a couple of examples
that are obviously not

01:28:45.530 --> 01:28:48.955
quite right, because my template might
have been a little bit too general.

01:28:48.955 --> 01:28:51.080
Maybe it was looking for
a broad category of things

01:28:51.080 --> 01:28:55.190
and certain strange things happened to
capture on that particular template.

01:28:55.190 --> 01:28:58.730
So you could imagine adding rules to try
and make this process more intelligent,

01:28:58.730 --> 01:29:02.000
making sure the thing on the left
is just a year, for example--

01:29:02.000 --> 01:29:04.280
for instance, and doing
other sorts of analysis.

01:29:04.280 --> 01:29:07.040
But purely just based
on some data, we are

01:29:07.040 --> 01:29:10.700
able to extract some interesting
information using some algorithms.

01:29:10.700 --> 01:29:16.100
And all search.py is really doing here
is it is taking my corpus of data,

01:29:16.100 --> 01:29:18.260
finding templates that match it--

01:29:18.260 --> 01:29:22.280
here, I'm filtering down to just the
top two templates that happen to match--

01:29:22.280 --> 01:29:26.960
and then using those templates
to extract results from the data

01:29:26.960 --> 01:29:30.860
that I have access to, being able
to look for all of the information

01:29:30.860 --> 01:29:31.670
that I care about.

01:29:31.670 --> 01:29:33.587
And that's ultimately
what's going to help me,

01:29:33.587 --> 01:29:38.390
to print out those results to figure
out what the matches happen to be.

01:29:38.390 --> 01:29:41.090
And so information extraction
is another powerful tool

01:29:41.090 --> 01:29:43.970
when it comes to trying
to extract information.

01:29:43.970 --> 01:29:46.220
But of course, it only works
in very limited contexts.

01:29:46.220 --> 01:29:49.640
It only works when I'm able will
find templates that look exactly

01:29:49.640 --> 01:29:53.000
like this in order to come up
with some sort of match that

01:29:53.000 --> 01:29:55.430
is able to connect this
to some pair of data,

01:29:55.430 --> 01:29:57.890
that this company was
founded in this year.

01:29:57.890 --> 01:30:01.670
What I might want to do, as we start
to think about the semantics of words,

01:30:01.670 --> 01:30:04.880
is to begin to imagine some way
of coming up with definitions

01:30:04.880 --> 01:30:08.120
for all words, being able to relate
all of the words in a dictionary

01:30:08.120 --> 01:30:12.110
to each other, because that's ultimately
what's going to be necessary if we want

01:30:12.110 --> 01:30:13.530
our AI to be able to communicate.

01:30:13.530 --> 01:30:18.500
We need some representation
of what it is that words mean.

01:30:18.500 --> 01:30:22.340
And one approach of doing this,
this famous data set called WordNet.

01:30:22.340 --> 01:30:24.440
And what WordNet is is
it's a human-curated--

01:30:24.440 --> 01:30:27.380
researchers have curated
together a whole bunch of words,

01:30:27.380 --> 01:30:29.595
their definitions, their
various different senses--

01:30:29.595 --> 01:30:31.970
because the word might have
multiple different meanings--

01:30:31.970 --> 01:30:35.347
and also how those words
relate to one another.

01:30:35.347 --> 01:30:36.680
And so what we mean by this is--

01:30:36.680 --> 01:30:38.750
I can show you an example of WordNet.

01:30:38.750 --> 01:30:40.550
WordNet comes built into NLTK.

01:30:40.550 --> 01:30:44.060
Using NLTK, you can
download and access WordNet.

01:30:44.060 --> 01:30:48.080
So let me go into WordNet,
and go ahead and run WordNet,

01:30:48.080 --> 01:30:52.100
and extract information about a
word-- a word like city, for example.

01:30:52.100 --> 01:30:53.600
Go ahead and press Return.

01:30:53.600 --> 01:30:56.210
And here is the information
that I get back about a city.

01:30:56.210 --> 01:30:59.360
It turns out that city has
three different senses, three

01:30:59.360 --> 01:31:01.460
different meanings,
according to WordNet.

01:31:01.460 --> 01:31:03.770
And it's really just kind
of like a dictionary, where

01:31:03.770 --> 01:31:07.400
each sense is associated with its
meaning-- just some definition

01:31:07.400 --> 01:31:08.810
provided by human.

01:31:08.810 --> 01:31:13.130
And then it's also got categories,
for example, that a word belongs to--

01:31:13.130 --> 01:31:15.830
that a city is a type
of municipality, a city

01:31:15.830 --> 01:31:18.150
is a type of administrative district.

01:31:18.150 --> 01:31:20.510
And that allows me to
relate words to other words.

01:31:20.510 --> 01:31:24.380
So one of the powers of WordNet
is the ability to take one word

01:31:24.380 --> 01:31:28.590
and connect it to other related words.

01:31:28.590 --> 01:31:33.380
If I do another example, let me
try the word house, for instance.

01:31:33.380 --> 01:31:36.690
I'll type in the word house
and see what I get back.

01:31:36.690 --> 01:31:38.750
Well, all right, the house
is a kind of building.

01:31:38.750 --> 01:31:42.160
The house is somehow
related to a family unit.

01:31:42.160 --> 01:31:43.910
And so you might imagine
trying to come up

01:31:43.910 --> 01:31:46.760
with these various different
ways of describing a house.

01:31:46.760 --> 01:31:47.490
It is a building.

01:31:47.490 --> 01:31:48.500
It is a dwelling.

01:31:48.500 --> 01:31:51.110
And researchers have just
curated these relationships

01:31:51.110 --> 01:31:55.100
between these various different words to
say that a house is a type of building,

01:31:55.100 --> 01:31:58.890
that a house is a type
of dwelling, for example.

01:31:58.890 --> 01:32:01.370
But this type of
approach, while certainly

01:32:01.370 --> 01:32:04.640
helpful for being able to
relate words to one another,

01:32:04.640 --> 01:32:06.920
doesn't scale particularly well.

01:32:06.920 --> 01:32:08.990
As you start to think
about language changing,

01:32:08.990 --> 01:32:11.870
as you start to think about all
the various different relationships

01:32:11.870 --> 01:32:16.070
that words might have to one another,
this challenge of word representation

01:32:16.070 --> 01:32:18.200
ends up being difficult.
What we've done is just

01:32:18.200 --> 01:32:23.450
defined a word as just a sentence that
explains what it is that that word is,

01:32:23.450 --> 01:32:26.030
but what we really
would like is some way

01:32:26.030 --> 01:32:28.615
to represent the meaning
of a word in a way

01:32:28.615 --> 01:32:31.240
that our AI is going to be able
to do something useful with it.

01:32:31.240 --> 01:32:33.830
Anytime we want our AI to
be able to look at texts

01:32:33.830 --> 01:32:35.840
and really understand
what that text means,

01:32:35.840 --> 01:32:38.360
to relate text and
words to similar words

01:32:38.360 --> 01:32:40.700
and understand the
relationship between words,

01:32:40.700 --> 01:32:44.745
we'd like some way that a computer
can represent this information.

01:32:44.745 --> 01:32:46.620
And what we've seen all
throughout the course

01:32:46.620 --> 01:32:48.800
multiple times now is
the idea that, when

01:32:48.800 --> 01:32:51.110
we want our AI to
represent something, it

01:32:51.110 --> 01:32:54.890
can be helpful to have the AI
represent it using numbers--

01:32:54.890 --> 01:32:57.530
that we've seen that we can
represent utilities in a game,

01:32:57.530 --> 01:32:59.900
like winning, or losing,
or drawing, as a number--

01:32:59.900 --> 01:33:01.520
1, negative 1, or a 0.

01:33:01.520 --> 01:33:04.400
We've seen other ways that
we can take data and turn it

01:33:04.400 --> 01:33:06.650
into a vector of features,
where we just have

01:33:06.650 --> 01:33:11.270
a whole bunch of numbers that represent
some particular piece of data.

01:33:11.270 --> 01:33:14.340
And if we ever want to past
words into a neural network,

01:33:14.340 --> 01:33:16.580
for instance, to be able
to say, given some word,

01:33:16.580 --> 01:33:18.650
translate this sentence
into another sentence,

01:33:18.650 --> 01:33:21.890
or to be able to do interesting
classifications with neural networks

01:33:21.890 --> 01:33:26.000
on individual words, we need
some representation of words

01:33:26.000 --> 01:33:27.980
just in terms of vectors--

01:33:27.980 --> 01:33:31.820
way to represent words, just
by using individual numbers

01:33:31.820 --> 01:33:34.495
to define the meaning of a word.

01:33:34.495 --> 01:33:35.370
So how do we do that?

01:33:35.370 --> 01:33:37.767
How do we take words and
turn them into vectors

01:33:37.767 --> 01:33:40.100
that we can use to represent
the meaning of those words?

01:33:40.100 --> 01:33:42.110
Well, one way is to do this.

01:33:42.110 --> 01:33:46.280
If I have four words that I want
to encode, like he wrote a book,

01:33:46.280 --> 01:33:49.250
I can just say, let's let
the word he be this vector--

01:33:49.250 --> 01:33:51.470
1, 0, 0, 0.

01:33:51.470 --> 01:33:53.990
Wrote will be 0, 1, 0, 0.

01:33:53.990 --> 01:33:56.390
A will be 0, 0, 1, 0.

01:33:56.390 --> 01:33:59.570
Book will be 0, 0, 0, 1.

01:33:59.570 --> 01:34:03.410
Effectively, what I have here is what's
known as a one-hot representation

01:34:03.410 --> 01:34:06.930
or a one-hot encoding, which
is a representation of meaning,

01:34:06.930 --> 01:34:10.580
where meaning is a vector that has a
single 1 in it and the rest are 0's.

01:34:10.580 --> 01:34:14.540
The location of the 1 tells
me the meaning of the word--

01:34:14.540 --> 01:34:17.020
that 1 in the first
position, that means here--

01:34:17.020 --> 01:34:19.510
1 in the second position,
that means wrote.

01:34:19.510 --> 01:34:21.740
And every word in the
dictionary is going

01:34:21.740 --> 01:34:24.770
to be assigned to some representation
like this, where we just

01:34:24.770 --> 01:34:28.320
assign one place in the vector
that has a 1 for the word

01:34:28.320 --> 01:34:29.450
and 0 for the other words.

01:34:29.450 --> 01:34:31.580
And now I have
representations of words that

01:34:31.580 --> 01:34:33.710
are different for a whole
bunch of different words.

01:34:33.710 --> 01:34:36.853
This is this one-hot representation.

01:34:36.853 --> 01:34:38.270
So what are the drawbacks of this?

01:34:38.270 --> 01:34:40.970
Why is this not necessarily
a great approach?

01:34:40.970 --> 01:34:42.980
Well, here, I am only
creating enough vectors

01:34:42.980 --> 01:34:45.530
to represent four words in a dictionary.

01:34:45.530 --> 01:34:49.580
If you imagine a dictionary with 50,000
words that I might want to represent,

01:34:49.580 --> 01:34:51.590
now these vectors get enormously long.

01:34:51.590 --> 01:34:54.800
These are 50,000 dimensional
vectors to represent

01:34:54.800 --> 01:34:58.940
a vocabulary of 50,000 words--
that he is 1 followed by all these.

01:34:58.940 --> 01:35:01.280
Wrote has a whole bunch of 0's in it.

01:35:01.280 --> 01:35:05.070
That's not a particularly tractable
way of trying to represent numbers,

01:35:05.070 --> 01:35:09.860
if I'm going to have to deal
with vectors of length 50,000.

01:35:09.860 --> 01:35:12.140
Another problem-- a subtler problem--

01:35:12.140 --> 01:35:14.870
is that ideally, I'd
like for these vectors

01:35:14.870 --> 01:35:17.960
to somehow represent meaning
in a way that I can extract

01:35:17.960 --> 01:35:21.740
useful information out of-- that if
I have the sentence he wrote a book

01:35:21.740 --> 01:35:26.270
and he authored a novel, well, wrote
and authored are going to be two

01:35:26.270 --> 01:35:28.040
totally different vectors.

01:35:28.040 --> 01:35:32.180
And book and novel are going to be
two totally different vectors inside

01:35:32.180 --> 01:35:35.030
of my vector space that have
nothing to do with each other.

01:35:35.030 --> 01:35:38.420
The one is just located
in a different position.

01:35:38.420 --> 01:35:40.790
And really, what I would
like to have happen

01:35:40.790 --> 01:35:43.600
is for wrote and
authored to have vectors

01:35:43.600 --> 01:35:47.020
that are similar to one
another, and for book and novel

01:35:47.020 --> 01:35:49.900
to have vector representations
that are similar to one another,

01:35:49.900 --> 01:35:52.780
because they are words
that have similar meanings.

01:35:52.780 --> 01:35:56.320
Because their meanings are
similar, ideally, I'd like for--

01:35:56.320 --> 01:35:59.860
when I put them in vector form and
use a vector to represent meanings,

01:35:59.860 --> 01:36:04.400
I would like for those vectors to
be similar to one another as well.

01:36:04.400 --> 01:36:06.640
So rather than this
one-hot representation,

01:36:06.640 --> 01:36:10.000
where we represent a word's meaning
by just giving it a vector that is one

01:36:10.000 --> 01:36:12.620
in a particular location,
what we're going to do--

01:36:12.620 --> 01:36:15.400
which is a bit of a strange
thing the first time you see it--

01:36:15.400 --> 01:36:18.640
is what we're going to call
a distributed representation.

01:36:18.640 --> 01:36:21.580
We are going to represent
the meaning of a word as just

01:36:21.580 --> 01:36:25.330
a whole bunch of different values--
not just a single 1 and the rest 0's,

01:36:25.330 --> 01:36:26.630
but a whole bunch of values.

01:36:26.630 --> 01:36:31.240
So for example, in he wrote a book,
he might just be a big vector.

01:36:31.240 --> 01:36:34.510
Maybe it's 50 dimensions, maybe it's
100, dimensions but certainly less

01:36:34.510 --> 01:36:39.430
than like tens of thousands, where
each value is just some number--

01:36:39.430 --> 01:36:42.160
and same thing for
wrote, and a, and book.

01:36:42.160 --> 01:36:45.070
And the idea now is that, using
these vector representations,

01:36:45.070 --> 01:36:48.850
I'd hope that wrote and authored
have vector representations that

01:36:48.850 --> 01:36:50.317
are pretty close to one another.

01:36:50.317 --> 01:36:52.900
Their distance is not too far
apart-- and same with the vector

01:36:52.900 --> 01:36:56.230
representations for book and novel.

01:36:56.230 --> 01:37:00.940
So this is going to be the goal of a
lot of what statistical machine learning

01:37:00.940 --> 01:37:02.710
approaches to natural
language processing

01:37:02.710 --> 01:37:06.760
is about is using these vector
representations of words.

01:37:06.760 --> 01:37:10.190
But how on earth do we define
a word as just a whole bunch

01:37:10.190 --> 01:37:11.440
of these sequences of numbers?

01:37:11.440 --> 01:37:16.668
What does it even mean to talk
about the meaning of a word?

01:37:16.668 --> 01:37:18.460
The famous quote that
answers this question

01:37:18.460 --> 01:37:22.930
is from a British linguist in the
1950s, JR Firth, who said, "You shall

01:37:22.930 --> 01:37:25.060
know a word by the company it keeps."

01:37:28.150 --> 01:37:30.400
And what we mean by
that is the idea that we

01:37:30.400 --> 01:37:35.290
can define a word in terms of the words
that show up around it, that we can get

01:37:35.290 --> 01:37:39.070
at the meaning of a word based on the
context in which that word happens

01:37:39.070 --> 01:37:40.370
to appear.

01:37:40.370 --> 01:37:43.900
That if I have a sentence like
this, four words in sequence--

01:37:43.900 --> 01:37:46.180
for blank he ate--

01:37:46.180 --> 01:37:47.442
what goes in the blank?

01:37:47.442 --> 01:37:49.150
Well, you might imagine
that, in English,

01:37:49.150 --> 01:37:52.192
the types of words that might fill in
the blank are words like breakfast,

01:37:52.192 --> 01:37:53.170
or lunch, or dinner.

01:37:53.170 --> 01:37:56.480
These are the kinds of words
that fill in that blank.

01:37:56.480 --> 01:38:00.730
And so if we want to define,
what does lunch or dinner mean,

01:38:00.730 --> 01:38:03.970
we can define it in terms
of what words happened

01:38:03.970 --> 01:38:07.030
to show up around it--
that if a word shows up

01:38:07.030 --> 01:38:09.700
in a particular context and
another word happens to show up

01:38:09.700 --> 01:38:13.750
in very similar context, then
those two words are probably

01:38:13.750 --> 01:38:15.040
related to each other.

01:38:15.040 --> 01:38:18.280
They probably have a similar
meaning to one another.

01:38:18.280 --> 01:38:20.950
And this then is the
foundational idea of an algorithm

01:38:20.950 --> 01:38:24.760
known as word2vec, which is a
model for generating word vectors.

01:38:24.760 --> 01:38:28.960
You give word2vec a corpus of
documents, just a whole bunch of texts,

01:38:28.960 --> 01:38:34.832
and what word to that will produce is
it will produce vectors for each word.

01:38:34.832 --> 01:38:36.790
And there a number of
ways that it can do this.

01:38:36.790 --> 01:38:40.300
One common way is through what's known
as the skip-gram architecture, which

01:38:40.300 --> 01:38:44.470
basically uses a neural network
to predict context words,

01:38:44.470 --> 01:38:47.240
given a target word-- so
given a word like lunch,

01:38:47.240 --> 01:38:50.350
use a neural network to try and
predict, given the word lunch, what

01:38:50.350 --> 01:38:53.190
words are going to show up around it.

01:38:53.190 --> 01:38:55.210
And so the way we
might represent this is

01:38:55.210 --> 01:38:57.760
with a big neural
network like this, where

01:38:57.760 --> 01:39:00.820
we have one input cell for every word.

01:39:00.820 --> 01:39:04.900
Every word gets one node
inside this neural network.

01:39:04.900 --> 01:39:07.780
And the goal is to use this
neural network to predict,

01:39:07.780 --> 01:39:09.790
given a target word, a context word.

01:39:09.790 --> 01:39:14.030
Given a word like lunch, can I predict
the probabilities of other words,

01:39:14.030 --> 01:39:18.560
showing up in a context of one word
away or two words away, for instance,

01:39:18.560 --> 01:39:21.970
in some sort of window of context?

01:39:21.970 --> 01:39:27.400
And if you just give the AI, this neural
network, a whole bunch of data of words

01:39:27.400 --> 01:39:30.790
and what words show up in context,
you can train a neural network

01:39:30.790 --> 01:39:34.600
to do this calculation, to be able
to predict, given a target word--

01:39:34.600 --> 01:39:39.103
can I predict what those context
words ultimately should be?

01:39:39.103 --> 01:39:41.020
And it will do so using
the same methods we've

01:39:41.020 --> 01:39:43.850
talked about-- back propagating
the error from the context word

01:39:43.850 --> 01:39:46.090
back through this neural network.

01:39:46.090 --> 01:39:48.790
And what you get is, if
we use the single layer--

01:39:48.790 --> 01:39:50.950
just a signal layer of hidden nodes--

01:39:50.950 --> 01:39:54.960
what I get is, for every single
one of these words, I get--

01:39:54.960 --> 01:39:59.680
from this word, for example, I
get five edges, each of which

01:39:59.680 --> 01:40:02.695
has a weight to each of
these five hidden nodes.

01:40:02.695 --> 01:40:05.950
In other words, I get five
numbers that effectively

01:40:05.950 --> 01:40:10.180
are going to represent this
particular target word here.

01:40:10.180 --> 01:40:13.750
And the number of hidden nodes I
choose in this middle layer here--

01:40:13.750 --> 01:40:14.420
I can pick that.

01:40:14.420 --> 01:40:17.830
Maybe I'll choose to have 50
hidden nodes or 100 hidden nodes.

01:40:17.830 --> 01:40:19.720
And then, for each of
these target words,

01:40:19.720 --> 01:40:22.630
I'll have 50 different values
or 100 different values,

01:40:22.630 --> 01:40:26.050
and those values we can
effectively treat as the vector

01:40:26.050 --> 01:40:29.320
numerical representation of that word.

01:40:29.320 --> 01:40:33.520
And the general idea here is
that, if words are similar,

01:40:33.520 --> 01:40:37.660
two words show up in similar contexts--
meaning, using the same target words,

01:40:37.660 --> 01:40:40.380
I'd like to predict
similar contexts words--

01:40:40.380 --> 01:40:43.180
well, then these vectors and these
values I choose in these vectors

01:40:43.180 --> 01:40:45.940
here-- these numerical values
for the weight of these edges

01:40:45.940 --> 01:40:49.180
are probably going to be similar,
because for two different words that

01:40:49.180 --> 01:40:51.580
show up in similar
contexts, I would like

01:40:51.580 --> 01:40:55.030
for these values that are
calculated to ultimately

01:40:55.030 --> 01:40:58.250
be very similar to one another.

01:40:58.250 --> 01:41:01.030
And so ultimately, the high-level
way you can picture this

01:41:01.030 --> 01:41:02.980
is that what this word2vec
training method is

01:41:02.980 --> 01:41:06.790
going to do is, given a whole
bunch of words, were initially,

01:41:06.790 --> 01:41:09.430
recall, we initialize these
weights randomly and just pick

01:41:09.430 --> 01:41:11.650
random weights that we choose.

01:41:11.650 --> 01:41:14.050
Over time, as we train
the neural network,

01:41:14.050 --> 01:41:17.680
we're going to adjust these weights,
adjust the vector representations

01:41:17.680 --> 01:41:20.860
of each of these words
so that gradually,

01:41:20.860 --> 01:41:24.970
words that show up in similar
contexts grow closer to one another,

01:41:24.970 --> 01:41:27.190
and words that show up
in different contexts

01:41:27.190 --> 01:41:29.210
get farther away from one another.

01:41:29.210 --> 01:41:32.890
And as a result, hopefully
I get vector representations

01:41:32.890 --> 01:41:36.760
of words like breakfast, and lunch, and
dinner that are similar to one another,

01:41:36.760 --> 01:41:39.100
and then words like book,
and memoir, and novel

01:41:39.100 --> 01:41:42.830
are also going to be similar
to one another as well.

01:41:42.830 --> 01:41:46.510
So using this algorithm, we're
able to take a corpus of data

01:41:46.510 --> 01:41:50.230
and just train our computer, train this
neural network to be able to figure out

01:41:50.230 --> 01:41:52.650
what vector, what sequence
of numbers is going

01:41:52.650 --> 01:41:55.900
to represent each of these words-- which
is, again, a bit of a strange concept

01:41:55.900 --> 01:41:59.450
to think about representing a word
just as a whole bunch of numbers.

01:41:59.450 --> 01:42:02.860
But we'll see in a moment just
how powerful this really can be.

01:42:02.860 --> 01:42:08.290
So we'll go ahead and go into vectors,
and what I have inside a vectors.py--

01:42:08.290 --> 01:42:09.910
which I'll open up now--

01:42:09.910 --> 01:42:14.800
is I'm opening up words.txt, which
is a pretrained model that just--

01:42:14.800 --> 01:42:17.230
I've already run word2vec
and it's already given me

01:42:17.230 --> 01:42:19.810
a whole bunch of vectors for
each of these possible words.

01:42:19.810 --> 01:42:22.330
And I'm just going to
take like 50,000 of them

01:42:22.330 --> 01:42:26.420
and go ahead and save their vectors
inside of a dictionary called words.

01:42:26.420 --> 01:42:29.260
And then I've also defined
some functions called distance,

01:42:29.260 --> 01:42:33.820
closest_word, so it'll get me what are
the closest words to a particular word,

01:42:33.820 --> 01:42:38.390
and then closest_word, that just gets
me the one closest word, for example.

01:42:38.390 --> 01:42:39.860
And so now let me try doing this.

01:42:39.860 --> 01:42:43.180
Let me open up the Python
interpreter and say something like,

01:42:43.180 --> 01:42:46.080
from vectors import star--

01:42:46.080 --> 01:42:48.590
just import everything from vectors.

01:42:48.590 --> 01:42:51.700
And now let's take a look at
the meanings of some words.

01:42:51.700 --> 01:42:55.760
Let me look at the
word city, for example.

01:42:55.760 --> 01:43:01.130
And here is a big array that is the
vector representation of the words

01:43:01.130 --> 01:43:01.630
city.

01:43:01.630 --> 01:43:04.755
And this doesn't mean anything, in
terms of what these numbers exactly are,

01:43:04.755 --> 01:43:07.390
but this is how my
computer is representing

01:43:07.390 --> 01:43:08.990
the meaning of the word city.

01:43:08.990 --> 01:43:11.200
We can do a different
word, like words house,

01:43:11.200 --> 01:43:14.860
and here then is the vector
representation of the word house,

01:43:14.860 --> 01:43:17.140
for example-- just a
whole bunch of numbers.

01:43:17.140 --> 01:43:20.650
And this is encoding somehow
the meaning of the word house.

01:43:20.650 --> 01:43:22.390
And how do I get at that idea?

01:43:22.390 --> 01:43:24.880
Well, one way to measure how
good this is is by looking at,

01:43:24.880 --> 01:43:29.282
what is the distance between
various different words?

01:43:29.282 --> 01:43:31.240
There a number of ways
you can define distance.

01:43:31.240 --> 01:43:33.310
In context of vectors,
one common way is what's

01:43:33.310 --> 01:43:35.860
known as the cosine distance
that has to do with measuring

01:43:35.860 --> 01:43:37.580
the angle between vectors.

01:43:37.580 --> 01:43:40.150
But in short, it's just
measuring, how far apart

01:43:40.150 --> 01:43:42.710
are these two vectors from each other?

01:43:42.710 --> 01:43:47.210
So if I take a word like the word book,
how far away for is it from itself--

01:43:47.210 --> 01:43:49.540
how far away is the
word book from book--

01:43:49.540 --> 01:43:50.440
well, that's zero.

01:43:50.440 --> 01:43:54.400
The word book is zero
distance away from itself.

01:43:54.400 --> 01:43:59.180
But let's see how far away word
book is from a word like breakfast,

01:43:59.180 --> 01:44:03.790
where we're going to say one is
very far away, zero is not far away.

01:44:03.790 --> 01:44:07.430
All right, book is about
0.64 away from breakfast.

01:44:07.430 --> 01:44:09.560
They seem to be pretty far apart.

01:44:09.560 --> 01:44:12.920
But let's now try and calculate
the distance from words book

01:44:12.920 --> 01:44:16.842
to words novel, for example.

01:44:16.842 --> 01:44:18.800
Now, those two words are
closer to each other--

01:44:18.800 --> 01:44:19.730
0.34.

01:44:19.730 --> 01:44:21.950
The vector representation
of the word book

01:44:21.950 --> 01:44:25.190
is closer to the vector
representation of the word novel

01:44:25.190 --> 01:44:28.350
than it is to the vector
representation of the word breakfast.

01:44:28.350 --> 01:44:34.010
And I can do the same thing and,
say, compare breakfast to lunch,

01:44:34.010 --> 01:44:35.765
for example.

01:44:35.765 --> 01:44:37.640
And those two words are
even closer together.

01:44:37.640 --> 01:44:40.010
They have an even more
similar relationship

01:44:40.010 --> 01:44:42.470
between one word and another.

01:44:42.470 --> 01:44:45.500
So now it seems we have some
representation of words,

01:44:45.500 --> 01:44:49.610
representing a word using vectors, that
allows us to be able to say something

01:44:49.610 --> 01:44:52.340
like words that are
similar to each other

01:44:52.340 --> 01:44:55.940
ultimately have a smaller distance
that happens to be between them.

01:44:55.940 --> 01:44:58.070
And this turns out to be
incredibly powerful to be

01:44:58.070 --> 01:45:01.760
able to represent the meaning of
words in terms of their relationships

01:45:01.760 --> 01:45:03.620
to other words as well.

01:45:03.620 --> 01:45:05.000
I can tell you as well--

01:45:05.000 --> 01:45:06.980
I have a function called
closest words that

01:45:06.980 --> 01:45:09.320
basically just takes
a whole bunch of words

01:45:09.320 --> 01:45:11.520
and gets all the closest words to it.

01:45:11.520 --> 01:45:15.980
So let me get the closest
words to book, for example,

01:45:15.980 --> 01:45:18.500
and maybe get the 10 closest words.

01:45:18.500 --> 01:45:20.950
We'll limit ourselves to 10.

01:45:20.950 --> 01:45:21.450
And right.

01:45:21.450 --> 01:45:24.420
Book is obviously closest
to itself-- the word book--

01:45:24.420 --> 01:45:27.630
but is also closely related to books,
and essay, and memoir, and essays,

01:45:27.630 --> 01:45:29.450
and novella, anthology.

01:45:29.450 --> 01:45:32.370
And why are these words that it was
able to compute are close to it?

01:45:32.370 --> 01:45:34.710
Well, because based on
the corpus of information

01:45:34.710 --> 01:45:38.220
that this algorithm was trained
on, the vectors that arose

01:45:38.220 --> 01:45:41.270
arose based on what words
show up in a similar context--

01:45:41.270 --> 01:45:45.420
that the word book shows up in a similar
context, similar other words to words

01:45:45.420 --> 01:45:47.730
like memoir and essays, for example.

01:45:47.730 --> 01:45:49.110
And if I do something like--

01:45:49.110 --> 01:45:53.740
let me get the closest words to city--

01:45:53.740 --> 01:45:56.800
you end up getting city,
town, township, village.

01:45:56.800 --> 01:46:02.200
These are words that happen to show up
in a similar context to the word city.

01:46:02.200 --> 01:46:05.787
Now, where things get really interesting
is that, because these are vectors,

01:46:05.787 --> 01:46:07.120
we can do mathematics with them.

01:46:07.120 --> 01:46:11.210
We can calculate the relationships
between various different words.

01:46:11.210 --> 01:46:16.240
So I can say something like, all
right, what if I had man and king?

01:46:16.240 --> 01:46:18.790
These are two different vectors,
and this is a famous example

01:46:18.790 --> 01:46:20.950
that comes out of word2vec.

01:46:20.950 --> 01:46:24.920
I can take these two vectors and
just subtract them from each other.

01:46:24.920 --> 01:46:28.040
This line here, the distance
here, is another vector

01:46:28.040 --> 01:46:30.430
that represents like king minus man.

01:46:30.430 --> 01:46:33.123
Now, what does it mean to take a
word and subtract another word?

01:46:33.123 --> 01:46:34.540
Normally, that doesn't make sense.

01:46:34.540 --> 01:46:37.082
In the world of vectors, though,
you can take some vector sum

01:46:37.082 --> 01:46:40.090
sequence of numbers, subtract
some other sequence of numbers,

01:46:40.090 --> 01:46:43.240
and get a new vector, get
a new sequence of numbers.

01:46:43.240 --> 01:46:46.690
And what this new sequence of
numbers is effectively going to do

01:46:46.690 --> 01:46:52.000
is it is going to tell me, what do I
need to do to get from man to king?

01:46:52.000 --> 01:46:54.640
What is the relationship
then between these two words?

01:46:54.640 --> 01:46:58.120
And this is some vector
representation of what makes--

01:46:58.120 --> 01:47:00.640
takes us from man to king.

01:47:00.640 --> 01:47:04.730
And we can then take this value
and add it to another vector.

01:47:04.730 --> 01:47:07.700
You might imagine that the
word woman, for example,

01:47:07.700 --> 01:47:10.330
is another vector that exists
somewhere inside of this space,

01:47:10.330 --> 01:47:12.430
somewhere inside of this vector space.

01:47:12.430 --> 01:47:15.550
And what might happen if I
took this same idea, king

01:47:15.550 --> 01:47:19.930
minus man-- took that same vector
and just added it to woman?

01:47:19.930 --> 01:47:22.480
What will we find around here?

01:47:22.480 --> 01:47:24.230
It's an interesting
question we might ask,

01:47:24.230 --> 01:47:27.700
and we can answer it very easily,
because I have vector representations

01:47:27.700 --> 01:47:30.500
of all of these things.

01:47:30.500 --> 01:47:31.660
Let's go back here.

01:47:31.660 --> 01:47:34.690
Let me look at the
representation of the word man.

01:47:34.690 --> 01:47:36.887
Here's the vector representation of men.

01:47:36.887 --> 01:47:38.970
Let's look at the
representation of the word king.

01:47:38.970 --> 01:47:41.222
Here's the representation
of the word king.

01:47:41.222 --> 01:47:42.430
And I can subtract these two.

01:47:42.430 --> 01:47:46.260
What is the vector
representation of king minus man?

01:47:46.260 --> 01:47:48.250
It's this array right here--

01:47:48.250 --> 01:47:49.600
whole bunch of values.

01:47:49.600 --> 01:47:53.620
So king minus man now represents the
relationship between king and man

01:47:53.620 --> 01:47:55.940
in some sort of numerical vector format.

01:47:55.940 --> 01:48:00.170
So what happens then
if I add woman to that?

01:48:00.170 --> 01:48:04.640
Whatever took us from man to king,
go ahead and apply that same vector

01:48:04.640 --> 01:48:07.520
to the vector representation
of the word woman,

01:48:07.520 --> 01:48:10.960
and that gives us this vector here.

01:48:10.960 --> 01:48:15.130
And now, just out of curiosity,
let's take this expression

01:48:15.130 --> 01:48:20.720
and find, what is the closest
word to that expression?

01:48:20.720 --> 01:48:25.130
And amazingly, what we get
is we get the word queen--

01:48:25.130 --> 01:48:28.820
that somehow, when you take the
distance between man and king--

01:48:28.820 --> 01:48:32.090
this numerical representation
of how man is related to king--

01:48:32.090 --> 01:48:34.780
and add that same
notion, king minus man,

01:48:34.780 --> 01:48:37.100
to the vector representation
of the word woman.

01:48:37.100 --> 01:48:40.790
What we get is we get the vector
representation, or something close

01:48:40.790 --> 01:48:43.490
to the vector representation
of the word queen,

01:48:43.490 --> 01:48:48.130
because this distance somehow encoded
the relationship between these two

01:48:48.130 --> 01:48:48.630
words.

01:48:48.630 --> 01:48:50.422
And when you run it
through this algorithm,

01:48:50.422 --> 01:48:53.240
it's not programmed to do this,
but if you just try and figure

01:48:53.240 --> 01:48:55.700
out how to predict words
based on context words,

01:48:55.700 --> 01:48:59.960
you get vectors that are able to
make these SAT-like analogies out

01:48:59.960 --> 01:49:02.232
of the information that has been given.

01:49:02.232 --> 01:49:03.690
So there are more examples of this.

01:49:03.690 --> 01:49:06.230
We can say, all right,
let's figure out, what

01:49:06.230 --> 01:49:10.790
is the distance between
Paris and France?

01:49:10.790 --> 01:49:12.580
So Paris and France are words.

01:49:12.580 --> 01:49:14.390
They each have a vector representation.

01:49:14.390 --> 01:49:18.680
This then is a vector representation of
the distance between Paris and France--

01:49:18.680 --> 01:49:21.530
what takes us from France to Paris.

01:49:21.530 --> 01:49:26.540
And let me go ahead and add the vector
representation of England to that.

01:49:26.540 --> 01:49:29.690
So this then is the
vector representation

01:49:29.690 --> 01:49:35.470
of going Paris minus
France plus England--

01:49:35.470 --> 01:49:38.130
so the distance between
friends and Paris as vectors.

01:49:38.130 --> 01:49:40.860
Add the England vector,
and let's go ahead

01:49:40.860 --> 01:49:43.860
and find the closest word to that.

01:49:47.080 --> 01:49:48.550
And it turns out to be London.

01:49:48.550 --> 01:49:51.610
You do this relationship, the
relationship between France and Paris.

01:49:51.610 --> 01:49:55.000
Go ahead and add the England vector
to it, and the closest vector to that

01:49:55.000 --> 01:49:57.120
happens to be the vector
for the word London.

01:49:57.120 --> 01:49:58.120
We can do more examples.

01:49:58.120 --> 01:50:00.700
I can say, let's take
the word for teacher--

01:50:00.700 --> 01:50:03.700
that vector representation
and-- let me subtract

01:50:03.700 --> 01:50:05.470
the vector representation of school.

01:50:05.470 --> 01:50:09.310
So what I'm left with is, what
takes us from school to teacher?

01:50:09.310 --> 01:50:14.050
And apply that vector to a
word like hospital and see,

01:50:14.050 --> 01:50:15.670
what is the closest word to that--

01:50:15.670 --> 01:50:17.680
turns out the closest word is nurse.

01:50:17.680 --> 01:50:23.400
Let's try a couple more examples--
closest word to ramen, for example.

01:50:23.400 --> 01:50:25.610
Subtract closest word to Japan.

01:50:25.610 --> 01:50:28.150
So what is the relationship
between Japan and ramen?

01:50:28.150 --> 01:50:30.310
Add the word for America to that.

01:50:30.310 --> 01:50:33.340
Want to take a guess is what
you might get as a result?

01:50:33.340 --> 01:50:35.840
Turns out you get burritos
as the relationship.

01:50:35.840 --> 01:50:38.050
If you do the subtraction,
do the addition,

01:50:38.050 --> 01:50:42.080
this is the answer that you happen to
get as a consequence of this as well.

01:50:42.080 --> 01:50:44.703
So these very interesting
analogies arise

01:50:44.703 --> 01:50:46.620
in the relationships
between these two words--

01:50:46.620 --> 01:50:50.420
that if you just map out all of
these words into a vector space,

01:50:50.420 --> 01:50:54.380
you can get some pretty interesting
results as a consequence of that.

01:50:54.380 --> 01:50:58.360
And this idea of representing
words as vectors turns out

01:50:58.360 --> 01:51:01.300
to be incredibly useful
and powerful anytime

01:51:01.300 --> 01:51:04.420
we want to be able to do
some statistical work with

01:51:04.420 --> 01:51:06.910
regards to natural language,
to be able to have--

01:51:06.910 --> 01:51:09.350
represent words not just
as their characters,

01:51:09.350 --> 01:51:12.280
but to represent them as numbers,
numbers that say something

01:51:12.280 --> 01:51:14.910
or mean something about
the words themselves,

01:51:14.910 --> 01:51:18.250
and somehow relate the meaning
of a word to other words that

01:51:18.250 --> 01:51:19.920
might happen to exists--

01:51:19.920 --> 01:51:23.020
so many tools then for
being able to work inside

01:51:23.020 --> 01:51:24.910
of this world of natural language.

01:51:24.910 --> 01:51:26.417
The natural language is tricky.

01:51:26.417 --> 01:51:29.500
We have to deal with the syntax of
language and the semantics of language,

01:51:29.500 --> 01:51:33.100
but we've really just seen just the
beginning of some of the ideas that are

01:51:33.100 --> 01:51:37.450
underlying a lot of natural language
processing-- the ability to take text,

01:51:37.450 --> 01:51:40.270
extract information out of it, get
some sort of meaning out of it,

01:51:40.270 --> 01:51:43.990
generate sentences maybe by having some
knowledge of the grammar or maybe just

01:51:43.990 --> 01:51:47.380
by looking at probabilities of what
words are likely to show up based

01:51:47.380 --> 01:51:49.780
on other words that have
shown up previously--

01:51:49.780 --> 01:51:52.300
and then finally, the
ability to take words

01:51:52.300 --> 01:51:55.330
and come up with some distributed
representation of them, to take words

01:51:55.330 --> 01:51:58.240
and represent them as
numbers, and use those numbers

01:51:58.240 --> 01:52:02.210
to be able to say something
meaningful about those words as well.

01:52:02.210 --> 01:52:04.390
So this then is yet another
topic in this broader

01:52:04.390 --> 01:52:06.300
heading of artificial intelligence.

01:52:06.300 --> 01:52:08.380
And just as I look back
at where we've been now,

01:52:08.380 --> 01:52:11.320
we started our conversation by
talking about the world of search,

01:52:11.320 --> 01:52:14.590
about trying to solve problems
like tic-tac-toe by searching

01:52:14.590 --> 01:52:17.500
for a solution, by exploring our
various different possibilities

01:52:17.500 --> 01:52:21.220
and looking at what algorithms we
can apply to be able to efficiently

01:52:21.220 --> 01:52:22.300
try and search a space.

01:52:22.300 --> 01:52:25.930
We looked at some simple algorithms
and then looked at some optimizations

01:52:25.930 --> 01:52:28.780
we could make to this
algorithms, and ultimately, that

01:52:28.780 --> 01:52:31.742
was in service of trying to get our
AI to know things about the world.

01:52:31.742 --> 01:52:34.450
And this has been a lot of what
we've talked about today as well,

01:52:34.450 --> 01:52:37.270
trying to get knowledge out
of text-based information,

01:52:37.270 --> 01:52:41.440
the ability to take information, draw
conclusions based on those information.

01:52:41.440 --> 01:52:43.630
If I know these two things
for certain, maybe I

01:52:43.630 --> 01:52:46.660
can draw a third conclusion as well.

01:52:46.660 --> 01:52:49.330
That then was related to
the idea of uncertainty.

01:52:49.330 --> 01:52:51.460
If we don't know
something for sure, can we

01:52:51.460 --> 01:52:54.420
predict something, figure out
the probabilities of something?

01:52:54.420 --> 01:52:56.170
And we saw that again
today in the context

01:52:56.170 --> 01:52:59.200
of trying to predict whether
a tweet or whether a message

01:52:59.200 --> 01:53:01.420
is positive sentiment
or negative sentiment,

01:53:01.420 --> 01:53:04.022
and trying to draw that
conclusion as well.

01:53:04.022 --> 01:53:05.980
Then we took a look at
optimization-- the sorts

01:53:05.980 --> 01:53:09.490
of problems where we're looking
for a local global or local maximum

01:53:09.490 --> 01:53:10.300
or minimum.

01:53:10.300 --> 01:53:13.420
This has come up time and time
again, especially most recently

01:53:13.420 --> 01:53:16.750
in the context of neural networks, which
are really just a kind of optimization

01:53:16.750 --> 01:53:20.110
problem where we're trying to
minimize the total amount of loss

01:53:20.110 --> 01:53:23.110
based on the setting of our
weights of our neural network,

01:53:23.110 --> 01:53:26.710
based on the setting of what
vector representations for words we

01:53:26.710 --> 01:53:27.880
happen to choose.

01:53:27.880 --> 01:53:30.430
And those ultimately helped
us to be able to solve

01:53:30.430 --> 01:53:33.940
learning-related problems-- the
ability to take a whole bunch of data,

01:53:33.940 --> 01:53:37.650
and rather than us tell
the AI exactly what to do,

01:53:37.650 --> 01:53:40.030
let the AI learn patterns
from the data for itself.

01:53:40.030 --> 01:53:43.770
Let it figure out what makes an inbox
message different from a spam message.

01:53:43.770 --> 01:53:45.520
Let it figure out what
makes a counterfeit

01:53:45.520 --> 01:53:47.560
bill different from an
authentic bill, and being

01:53:47.560 --> 01:53:49.820
able to draw that analysis as well.

01:53:49.820 --> 01:53:52.390
And one of the big tools
in learning that we used

01:53:52.390 --> 01:53:54.220
were neural networks,
these structures that

01:53:54.220 --> 01:53:58.180
allow us to relate inputs to outputs
by training these internal networks

01:53:58.180 --> 01:54:02.410
to learn some sort of function that
maps us from some input to some output--

01:54:02.410 --> 01:54:05.770
ultimately yet another model in this
language of artificial intelligence

01:54:05.770 --> 01:54:08.320
that we can use to
communicate with our AI.

01:54:08.320 --> 01:54:10.210
Then finally today,
we looked at some ways

01:54:10.210 --> 01:54:12.850
that AI can begin to communicate
with us, looking at ways

01:54:12.850 --> 01:54:16.240
that AI can begin to get an
understanding for the syntax

01:54:16.240 --> 01:54:19.990
and the semantics of language to
be able to generate sentences,

01:54:19.990 --> 01:54:23.110
to be able to predict things about
text that's written in a spoken

01:54:23.110 --> 01:54:25.360
language or a written
language like English,

01:54:25.360 --> 01:54:27.927
and to be able to do interesting
analysis there as well.

01:54:27.927 --> 01:54:30.010
And there's so much more
in active research that's

01:54:30.010 --> 01:54:33.160
happening all over the areas within
artificial intelligence today,

01:54:33.160 --> 01:54:36.890
and we've really only just seen the
beginning of what AI has to offer.

01:54:36.890 --> 01:54:39.310
So I hope you enjoyed this
exploration into this world

01:54:39.310 --> 01:54:41.235
of artificial intelligence with Python.

01:54:41.235 --> 01:54:44.110
A big thank you to the courses
teaching staff and the production team

01:54:44.110 --> 01:54:45.700
for making this class possible.

01:54:45.700 --> 01:54:49.940
This was an Introduction to
Artificial Intelligence with Python.